How to quick calculation max Erlang's processes and scheduler can alive based on machine specs

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

How to quick calculation max Erlang's processes and scheduler can alive based on machine specs

I Gusti Ngurah Oka Prinarjaya
Hi,

I'm a super newbie, I had done very very simple parallel processing using erlang. I experimenting with my database containing about hundreds of thousands rows. I split the rows into different offsets then assign each worker-processes different rows based on offsets. For each row i doing simple similar text calculation using binary:longest_common_prefix/1

Let's assume my total rows is 200,000 rows of data.
At first, i try to create 10 worker-processes, i assign 20,000 rows at each worker-process.
Second, i try to create 20 worker-processes, i assign 10,000 rows at each worker-process.
Third, i try to create 40 worker-processes, i assign 5000 rows at each worker-process.

My machine specs:
- MacBook Pro (13-inch, 2017, Four Thunderbolt 3 Ports)
- Processor 3,1 GHz Intel Core i5 ( 2 physical cores, with HT )
- RAM 8 GB 2133 MHz LPDDR3

My questions is

1. How to quick calculation / dumb / simple calculation max Erlang's processes based on above machine specs?

2. The running time when doing similar text processing with 10 worker, or 20 worker or 40 worker was very blazingly fast. So i cannot feel, i cannot see the difference. How to measure or something like printing total minutes out? So i can see the difference.

3. How many scheduler need to active / available when i create 10 processes? or 20 processes? 40 processes? and so on..

Please enlightenment

Thank you super much 






_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to quick calculation max Erlang's processes and scheduler can alive based on machine specs

t ty
The only suggestion I have is to also run a test where you use 200,000 workers i.e. a worker per row by spawning a process per row instead of managing the partitioning.

For measuring running times you can use timer:tc/3

On Sat, Jul 13, 2019 at 9:47 AM I Gusti Ngurah Oka Prinarjaya <[hidden email]> wrote:
Hi,

I'm a super newbie, I had done very very simple parallel processing using erlang. I experimenting with my database containing about hundreds of thousands rows. I split the rows into different offsets then assign each worker-processes different rows based on offsets. For each row i doing simple similar text calculation using binary:longest_common_prefix/1

Let's assume my total rows is 200,000 rows of data.
At first, i try to create 10 worker-processes, i assign 20,000 rows at each worker-process.
Second, i try to create 20 worker-processes, i assign 10,000 rows at each worker-process.
Third, i try to create 40 worker-processes, i assign 5000 rows at each worker-process.

My machine specs:
- MacBook Pro (13-inch, 2017, Four Thunderbolt 3 Ports)
- Processor 3,1 GHz Intel Core i5 ( 2 physical cores, with HT )
- RAM 8 GB 2133 MHz LPDDR3

My questions is

1. How to quick calculation / dumb / simple calculation max Erlang's processes based on above machine specs?

2. The running time when doing similar text processing with 10 worker, or 20 worker or 40 worker was very blazingly fast. So i cannot feel, i cannot see the difference. How to measure or something like printing total minutes out? So i can see the difference.

3. How many scheduler need to active / available when i create 10 processes? or 20 processes? 40 processes? and so on..

Please enlightenment

Thank you super much 





_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to quick calculation max Erlang's processes and scheduler can alive based on machine specs

Dániel Szoboszlay
In reply to this post by I Gusti Ngurah Oka Prinarjaya
Your second and third questions are easy to answer: measure the execution time of functions with timer:tc and even with a single scheduler you can run as many processes as you want. They will compete for a single core though, and will have to wait a long time to get some CPU time once scheduled out. So just stick to the default, and use as many schedulers as many cores you have.

Now, finding the maximum (or rather: optimal) number of processes to perform this particular task on your particular machine is hard. A very dumb calculation would be that because all of the processes will be doing the same, CPU-bound task, they will all compete for the same HW resources, so you won't gain much by having more processes than CPU cores (4 in your case). If accessing the rows involves some I/O, than you should use more processes, so some processes can run the CPU-bound text calculations while others wait for I/O. Try experimenting with different number of processes while monitoring the scheduler utilisation (e.g. with observer): if you're much below 100% utilisation (across all schedulers), you have too few, If, on the other hand, you see the run queue going up (the number of runnable processes that are waiting for a CPU slice to run), you have too many.

But you can safely use a bit more processes than the minimum needed to saturate the CPU. It can even speed up the whole job a bit if not all rows take equal time to process (consider one process getting a chunk of super slow to process rows: at the end of all other processes will have finished and you'll have to wait for this big worker to do its work on a single core; having twice as many processes would cut the chunk into two halves, also halving the time to wait at the end). However, after one (hard to find) point adding more processes would hurt performance: more processes means more cache misses and more synchronisation overhead at the beginning and end of the job.

The theoretical maximum number of processes is probably constrained by your RAM: measure how much memory one process needs, and divide 8 GB (minus some for the OS and other programs) with this number. You won't be able to fit more processes in RAM, and swapping will only slow down your computation. But this limit is probably in the thousands of processes range.

Hope this helps,
Daniel


On Sat, 13 Jul 2019 at 10:47, I Gusti Ngurah Oka Prinarjaya <[hidden email]> wrote:
Hi,

I'm a super newbie, I had done very very simple parallel processing using erlang. I experimenting with my database containing about hundreds of thousands rows. I split the rows into different offsets then assign each worker-processes different rows based on offsets. For each row i doing simple similar text calculation using binary:longest_common_prefix/1

Let's assume my total rows is 200,000 rows of data.
At first, i try to create 10 worker-processes, i assign 20,000 rows at each worker-process.
Second, i try to create 20 worker-processes, i assign 10,000 rows at each worker-process.
Third, i try to create 40 worker-processes, i assign 5000 rows at each worker-process.

My machine specs:
- MacBook Pro (13-inch, 2017, Four Thunderbolt 3 Ports)
- Processor 3,1 GHz Intel Core i5 ( 2 physical cores, with HT )
- RAM 8 GB 2133 MHz LPDDR3

My questions is

1. How to quick calculation / dumb / simple calculation max Erlang's processes based on above machine specs?

2. The running time when doing similar text processing with 10 worker, or 20 worker or 40 worker was very blazingly fast. So i cannot feel, i cannot see the difference. How to measure or something like printing total minutes out? So i can see the difference.

3. How many scheduler need to active / available when i create 10 processes? or 20 processes? 40 processes? and so on..

Please enlightenment

Thank you super much 





_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to quick calculation max Erlang's processes and scheduler can alive based on machine specs

I Gusti Ngurah Oka Prinarjaya
Hi Dániel,

>> Try experimenting with different number of processes while monitoring the scheduler utilisation (e.g. with observer): if you're much below 100% utilisation (across all 
>> schedulers), you have too few
I am lucky, i always get 100% utilisation

>> If, on the other hand, you see the run queue going up (the number of runnable processes that are waiting for a CPU slice to run), you have too many.
Where to see this?

Thank you :)


Pada tanggal Min, 14 Jul 2019 pukul 03.36 Dániel Szoboszlay <[hidden email]> menulis:
Your second and third questions are easy to answer: measure the execution time of functions with timer:tc and even with a single scheduler you can run as many processes as you want. They will compete for a single core though, and will have to wait a long time to get some CPU time once scheduled out. So just stick to the default, and use as many schedulers as many cores you have.

Now, finding the maximum (or rather: optimal) number of processes to perform this particular task on your particular machine is hard. A very dumb calculation would be that because all of the processes will be doing the same, CPU-bound task, they will all compete for the same HW resources, so you won't gain much by having more processes than CPU cores (4 in your case). If accessing the rows involves some I/O, than you should use more processes, so some processes can run the CPU-bound text calculations while others wait for I/O. Try experimenting with different number of processes while monitoring the scheduler utilisation (e.g. with observer): if you're much below 100% utilisation (across all schedulers), you have too few, If, on the other hand, you see the run queue going up (the number of runnable processes that are waiting for a CPU slice to run), you have too many.

But you can safely use a bit more processes than the minimum needed to saturate the CPU. It can even speed up the whole job a bit if not all rows take equal time to process (consider one process getting a chunk of super slow to process rows: at the end of all other processes will have finished and you'll have to wait for this big worker to do its work on a single core; having twice as many processes would cut the chunk into two halves, also halving the time to wait at the end). However, after one (hard to find) point adding more processes would hurt performance: more processes means more cache misses and more synchronisation overhead at the beginning and end of the job.

The theoretical maximum number of processes is probably constrained by your RAM: measure how much memory one process needs, and divide 8 GB (minus some for the OS and other programs) with this number. You won't be able to fit more processes in RAM, and swapping will only slow down your computation. But this limit is probably in the thousands of processes range.

Hope this helps,
Daniel


On Sat, 13 Jul 2019 at 10:47, I Gusti Ngurah Oka Prinarjaya <[hidden email]> wrote:
Hi,

I'm a super newbie, I had done very very simple parallel processing using erlang. I experimenting with my database containing about hundreds of thousands rows. I split the rows into different offsets then assign each worker-processes different rows based on offsets. For each row i doing simple similar text calculation using binary:longest_common_prefix/1

Let's assume my total rows is 200,000 rows of data.
At first, i try to create 10 worker-processes, i assign 20,000 rows at each worker-process.
Second, i try to create 20 worker-processes, i assign 10,000 rows at each worker-process.
Third, i try to create 40 worker-processes, i assign 5000 rows at each worker-process.

My machine specs:
- MacBook Pro (13-inch, 2017, Four Thunderbolt 3 Ports)
- Processor 3,1 GHz Intel Core i5 ( 2 physical cores, with HT )
- RAM 8 GB 2133 MHz LPDDR3

My questions is

1. How to quick calculation / dumb / simple calculation max Erlang's processes based on above machine specs?

2. The running time when doing similar text processing with 10 worker, or 20 worker or 40 worker was very blazingly fast. So i cannot feel, i cannot see the difference. How to measure or something like printing total minutes out? So i can see the difference.

3. How many scheduler need to active / available when i create 10 processes? or 20 processes? 40 processes? and so on..

Please enlightenment

Thank you super much 





_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to quick calculation max Erlang's processes and scheduler can alive based on machine specs

Roger Lipscombe-2
On Sun, 14 Jul 2019 at 07:24, I Gusti Ngurah Oka Prinarjaya
<[hidden email]> wrote:
> >> If, on the other hand, you see the run queue going up (the number of runnable processes that are waiting for a CPU slice to run), you have too many.
> Where to see this?

http://erlang.org/doc/man/erlang.html#statistics_run_queue_lengths
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to quick calculation max Erlang's processes and scheduler can alive based on machine specs

Dániel Szoboszlay
The run queue is also shown on the System tab of Observer (Statistics box). Not a time chart though, just the current value.

On Sun, 14 Jul 2019 at 10:48, Roger Lipscombe <[hidden email]> wrote:
On Sun, 14 Jul 2019 at 07:24, I Gusti Ngurah Oka Prinarjaya
<[hidden email]> wrote:
> >> If, on the other hand, you see the run queue going up (the number of runnable processes that are waiting for a CPU slice to run), you have too many.
> Where to see this?

http://erlang.org/doc/man/erlang.html#statistics_run_queue_lengths

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to quick calculation max Erlang's processes and scheduler can alive based on machine specs

Jesper Louis Andersen-2
In reply to this post by I Gusti Ngurah Oka Prinarjaya
On Sat, Jul 13, 2019 at 10:47 AM I Gusti Ngurah Oka Prinarjaya <[hidden email]> wrote:
Hi,

I'm a super newbie, I had done very very simple parallel processing using erlang. I experimenting with my database containing about hundreds of thousands rows. I split the rows into different offsets then assign each worker-processes different rows based on offsets. For each row i doing simple similar text calculation using binary:longest_common_prefix/1


First, you need to recognize you have a parallelism problem, and not a concurrency problem. So you are interested in what speedup you can get by adding more cores, compared to a single-process solution. The key analysis parameters are work, span and cost[0]. On top of that, you want to look at the speedup factor (S = T_1 / T_p).


1. How to quick calculation / dumb / simple calculation max Erlang's processes based on above machine specs?


This requires measurement. A single-core/process system have certain advantages:

* It doesn't need to lock and latch.
* It doesn't need to distribute data (scatter) and recombine data (gather).

Adding more processes has an overhead and at a point, it will cease to provide speedup. In fact, speedup might go down.

What I tend to do, is to napkin math the cost of a process. The PCB I usually set at 2048 bytes. It is probably lower in reality, but an upper bound is nice. If each process has to keep, say, 4096 bytes of data around, I set it at 2*4096 to account for the GC. So that is around 10 Kilobytes per process. If I have a million processes, that is 10 gigabytes of memory. If each process is also doing network I/O you need to account for the network buffers in the kernel as well, etc. However, since you are looking at parallelism, this has less importance since you don't want to keep a process per row (the overhead tends to be too big in that case, and the work is not concurrent anyway[1]).
2. The running time when doing similar text processing with 10 worker, or 20 worker or 40 worker was very blazingly fast. So i cannot feel, i cannot see the difference. How to measure or something like printing total minutes out? So i can see the difference.


timer:tc/1 is a good start. eministat[2] is a shameless plug as well.
3. How many scheduler need to active / available when i create 10 processes? or 20 processes? 40 processes? and so on..


If your machine has 2 physical cores with two hyperthreads per core, a first good ballpark is either 2 or 4 schedulers. Adding more just makes them fight for the resources. The `+stbt` option might come in handy if supported by your environment. Depending on your workload, you can expect some -30 to 50% extra performance out of the additional hyperthread. In some cases it hurts performance:

* Caches can be booted out by the additional hyperthread
* If you don't have memory presssure to make a thread wait, there is little additional power in the hyperthread
* In a laptop environment, the additonal hyperthread will generate more thermal heat. This might make the CPU clock down resulting in worse run times. This is especially important on MacBooks. They have really miserable thermals and add way too powerful CPUs in a bad thermal solution. It gives them good peak performance when "sprinting" for short bursts, but bad sustain performance, e.g., "marathons". Battery vs AC power also means a lot and will mess with runtimes.

As for how many processes: you want to have enough to keep all your schedulers utilized, but not so many your work is broken into tiny pieces. This will mean more scatter/gather IO is necessary, impeding your performance. And if that IO is going across CPU cores, you are also looking at waiting on caches.

If you are really interested in parallel processing, it is probably better to look at languages built for the problem space. Rust, with its rayon library. Or something like https://futhark-lang.org/ might be better suited. Or even look at TensorFlow. It has a really strong, optimized, numerical core. Erlang, being bytecode interpreted, pays an overhead which you have to balance out with either more productivity, ease of programming, faster prototyping or the like. Erlang tends to be stronger at MIMD style processing (and so does e.g., Go).

[1] your work is classical SIMD rather than MIMD.

--
J.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to quick calculation max Erlang's processes and scheduler can alive based on machine specs

I Gusti Ngurah Oka Prinarjaya
Hi Andersen,

Woww thank you very much for the explanation. 

>> First, you need to recognize you have a parallelism problem
Yes, i need parallelism. But i don't have time to research about GPU processing. 

Yes now i know, how much scheduler that i need to provide.

Thank you :)




Pada tanggal Sen, 15 Jul 2019 pukul 20.23 Jesper Louis Andersen <[hidden email]> menulis:
On Sat, Jul 13, 2019 at 10:47 AM I Gusti Ngurah Oka Prinarjaya <[hidden email]> wrote:
Hi,

I'm a super newbie, I had done very very simple parallel processing using erlang. I experimenting with my database containing about hundreds of thousands rows. I split the rows into different offsets then assign each worker-processes different rows based on offsets. For each row i doing simple similar text calculation using binary:longest_common_prefix/1


First, you need to recognize you have a parallelism problem, and not a concurrency problem. So you are interested in what speedup you can get by adding more cores, compared to a single-process solution. The key analysis parameters are work, span and cost[0]. On top of that, you want to look at the speedup factor (S = T_1 / T_p).


1. How to quick calculation / dumb / simple calculation max Erlang's processes based on above machine specs?


This requires measurement. A single-core/process system have certain advantages:

* It doesn't need to lock and latch.
* It doesn't need to distribute data (scatter) and recombine data (gather).

Adding more processes has an overhead and at a point, it will cease to provide speedup. In fact, speedup might go down.

What I tend to do, is to napkin math the cost of a process. The PCB I usually set at 2048 bytes. It is probably lower in reality, but an upper bound is nice. If each process has to keep, say, 4096 bytes of data around, I set it at 2*4096 to account for the GC. So that is around 10 Kilobytes per process. If I have a million processes, that is 10 gigabytes of memory. If each process is also doing network I/O you need to account for the network buffers in the kernel as well, etc. However, since you are looking at parallelism, this has less importance since you don't want to keep a process per row (the overhead tends to be too big in that case, and the work is not concurrent anyway[1]).
2. The running time when doing similar text processing with 10 worker, or 20 worker or 40 worker was very blazingly fast. So i cannot feel, i cannot see the difference. How to measure or something like printing total minutes out? So i can see the difference.


timer:tc/1 is a good start. eministat[2] is a shameless plug as well.
3. How many scheduler need to active / available when i create 10 processes? or 20 processes? 40 processes? and so on..


If your machine has 2 physical cores with two hyperthreads per core, a first good ballpark is either 2 or 4 schedulers. Adding more just makes them fight for the resources. The `+stbt` option might come in handy if supported by your environment. Depending on your workload, you can expect some -30 to 50% extra performance out of the additional hyperthread. In some cases it hurts performance:

* Caches can be booted out by the additional hyperthread
* If you don't have memory presssure to make a thread wait, there is little additional power in the hyperthread
* In a laptop environment, the additonal hyperthread will generate more thermal heat. This might make the CPU clock down resulting in worse run times. This is especially important on MacBooks. They have really miserable thermals and add way too powerful CPUs in a bad thermal solution. It gives them good peak performance when "sprinting" for short bursts, but bad sustain performance, e.g., "marathons". Battery vs AC power also means a lot and will mess with runtimes.

As for how many processes: you want to have enough to keep all your schedulers utilized, but not so many your work is broken into tiny pieces. This will mean more scatter/gather IO is necessary, impeding your performance. And if that IO is going across CPU cores, you are also looking at waiting on caches.

If you are really interested in parallel processing, it is probably better to look at languages built for the problem space. Rust, with its rayon library. Or something like https://futhark-lang.org/ might be better suited. Or even look at TensorFlow. It has a really strong, optimized, numerical core. Erlang, being bytecode interpreted, pays an overhead which you have to balance out with either more productivity, ease of programming, faster prototyping or the like. Erlang tends to be stronger at MIMD style processing (and so does e.g., Go).

[1] your work is classical SIMD rather than MIMD.

--
J.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions