Erlang distribution links don't fully utilise available resources - OTP 22.0.2 - Why?

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Erlang distribution links don't fully utilise available resources - OTP 22.0.2 - Why?

Gerhard Lazu

Hi,

We are trying to understand what prevents the Erlang distribution link from saturating the network. Even though there is plenty of CPU, memory & network bandwidth, the Erlang distribution doesn't fully utilise available resources. Can you help us figure out why?

We have a 3-node Erlang 22.0.2 cluster running on Ubuntu 16.04 x86 64bit.

This is the maximum network throughput between node-a & node-b, as measured by iperf:

iperf -t 30 -c node-b
------------------------------------------------------------
Client connecting to 10.0.1.37, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  3] local 10.0.1.36 port 43576 connected with 10.0.1.37 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-30.0 sec  78.8 GBytes  22.6 Gbits/sec

We ran this multiple times, in different directions & with different degree of parallelism, the maximum network throughput is roughly 22 Gbit/s.

We run the following command on node-a:

B = fun F() -> rpc:cast('foo@node-b', erlang, is_binary, [<<0:10000000/unit:8>>]), F() end.
[spawn(fun() -> B() end) || _ <- lists:seq(1, 100)].

This is what the network reports on node-a:

dstat -n 1 10
-net/total-
 recv  send
   0     0
 676k  756M
 643k  767M
 584k  679M
 693k  777M
 648k  745M
 660k  745M
 667k  772M
 651k  709M
 675k  782M
 688k  819M

That roughly translates to 6 Gbit/s. In other words, the Erlang distribution link between node-a & node-b is maxing out at around ~6 Gbit/s. Erlang distribution is limited to 27% of what we are measuring consistently and repeatedly outside of Erlang. In other words, iperf is 3.6x faster than an Erlang distribution link. It gets better.

If we start another 100 processes pumping 10Mbyte messages from node-a to node-c, we see the network throughput double:

dstat -n 1 10
-net/total-
 recv  send
   0     0
1303k 1463M
1248k 1360M
1332k 1458M
1480k 1569M
1339k 1455M
1413k 1494M
1395k 1431M
1359k 1514M
1438k 1564M
1379k 1489M

So 2 distribution links - each to a separate node - utilise 12Gbit/s out of the 22Gbit/s available on node-a.

What is preventing the distribution links pushing more data through? There is plenty of CPU & memory available (all nodes have 16 CPUs & 104GB MEM - n1-highmem-16):

dstat -cm 1 10
----total-cpu-usage---- ------memory-usage-----
usr sys idl wai hiq siq| used  buff  cach  free
 10   6  84   0   0   1|16.3G  118M  284M 85.6G
 20   6  73   0   0   1|16.3G  118M  284M 85.6G
 20   6  74   0   0   0|16.3G  118M  284M 85.6G
 18   6  76   0   0   0|16.4G  118M  284M 85.5G
 19   6  74   0   0   1|16.4G  118M  284M 85.4G
 17   4  78   0   0   0|16.5G  118M  284M 85.4G
 20   6  74   0   0   0|16.5G  118M  284M 85.4G
 19   6  74   0   0   0|16.5G  118M  284M 85.4G
 19   5  76   0   0   1|16.5G  118M  284M 85.4G
 18   6  75   0   0   0|16.5G  118M  284M 85.4G
 18   6  75   0   0   0|16.6G  118M  284M 85.3G

The only smoking gun is the distribution output queue buffer: https://grafana.gcp.rabbitmq.com/dashboard/snapshot/H329EfN3SFhsveA20ei7jC7JMFHAm8Ru?orgId=1&fullscreen&panelId=62

Speaking of which, we look forward to erlang/otp#2270 being merged: https://github.com/erlang/otp/pull/2270

All distribution metrics are available here: https://grafana.gcp.rabbitmq.com/dashboard/snapshot/H329EfN3SFhsveA20ei7jC7JMFHAm8Ru?orgId=1

If you want to see the state of distribution links & dist process state (they are all green btw), check the point-in-time metrics (they will expire in 15 days from today): https://grafana.gcp.rabbitmq.com/d/d-SFCCmZz/erlang-distribution?from=1560775955127&to=1560779424482

How can we tell what is preventing the distribution link from using all available bandwidth?

Are we missing a configuration flag? These are all the relevant beam.smp flags that we are using: https://github.com/erlang/otp/pull/2270#issuecomment-500953352


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Erlang distribution links don't fully utilise available resources - OTP 22.0.2 - Why?

Dmytro Lytovchenko
I believe the Erlang distribution is the wrong thing to use if you want to saturate the network.
There is plenty of overhead for each incoming message, the data gets copied, then encoded (copied again) then sent, then received (copied), then decoded (copied again) and sent to the destination process (copied again). Then the receiving processes might be slow to fetch the incoming data, they aren't running in hard real time and sometimes go to sleep.


I remember there were suggestions to use regular TCP connections, consider using user-mode driver (kernel calls have a cost) and low level NIF driver for that, with the intent of delivering highest gigabits from your hardware.

On Mon, 17 Jun 2019 at 16:49, Gerhard Lazu <[hidden email]> wrote:

Hi,

We are trying to understand what prevents the Erlang distribution link from saturating the network. Even though there is plenty of CPU, memory & network bandwidth, the Erlang distribution doesn't fully utilise available resources. Can you help us figure out why?

We have a 3-node Erlang 22.0.2 cluster running on Ubuntu 16.04 x86 64bit.

This is the maximum network throughput between node-a & node-b, as measured by iperf:

iperf -t 30 -c node-b
------------------------------------------------------------
Client connecting to 10.0.1.37, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  3] local 10.0.1.36 port 43576 connected with 10.0.1.37 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-30.0 sec  78.8 GBytes  22.6 Gbits/sec

We ran this multiple times, in different directions & with different degree of parallelism, the maximum network throughput is roughly 22 Gbit/s.

We run the following command on node-a:

B = fun F() -> rpc:cast('foo@node-b', erlang, is_binary, [<<0:10000000/unit:8>>]), F() end.
[spawn(fun() -> B() end) || _ <- lists:seq(1, 100)].

This is what the network reports on node-a:

dstat -n 1 10
-net/total-
 recv  send
   0     0
 676k  756M
 643k  767M
 584k  679M
 693k  777M
 648k  745M
 660k  745M
 667k  772M
 651k  709M
 675k  782M
 688k  819M

That roughly translates to 6 Gbit/s. In other words, the Erlang distribution link between node-a & node-b is maxing out at around ~6 Gbit/s. Erlang distribution is limited to 27% of what we are measuring consistently and repeatedly outside of Erlang. In other words, iperf is 3.6x faster than an Erlang distribution link. It gets better.

If we start another 100 processes pumping 10Mbyte messages from node-a to node-c, we see the network throughput double:

dstat -n 1 10
-net/total-
 recv  send
   0     0
1303k 1463M
1248k 1360M
1332k 1458M
1480k 1569M
1339k 1455M
1413k 1494M
1395k 1431M
1359k 1514M
1438k 1564M
1379k 1489M

So 2 distribution links - each to a separate node - utilise 12Gbit/s out of the 22Gbit/s available on node-a.

What is preventing the distribution links pushing more data through? There is plenty of CPU & memory available (all nodes have 16 CPUs & 104GB MEM - n1-highmem-16):

dstat -cm 1 10
----total-cpu-usage---- ------memory-usage-----
usr sys idl wai hiq siq| used  buff  cach  free
 10   6  84   0   0   1|16.3G  118M  284M 85.6G
 20   6  73   0   0   1|16.3G  118M  284M 85.6G
 20   6  74   0   0   0|16.3G  118M  284M 85.6G
 18   6  76   0   0   0|16.4G  118M  284M 85.5G
 19   6  74   0   0   1|16.4G  118M  284M 85.4G
 17   4  78   0   0   0|16.5G  118M  284M 85.4G
 20   6  74   0   0   0|16.5G  118M  284M 85.4G
 19   6  74   0   0   0|16.5G  118M  284M 85.4G
 19   5  76   0   0   1|16.5G  118M  284M 85.4G
 18   6  75   0   0   0|16.5G  118M  284M 85.4G
 18   6  75   0   0   0|16.6G  118M  284M 85.3G

The only smoking gun is the distribution output queue buffer: https://grafana.gcp.rabbitmq.com/dashboard/snapshot/H329EfN3SFhsveA20ei7jC7JMFHAm8Ru?orgId=1&fullscreen&panelId=62

Speaking of which, we look forward to erlang/otp#2270 being merged: https://github.com/erlang/otp/pull/2270

All distribution metrics are available here: https://grafana.gcp.rabbitmq.com/dashboard/snapshot/H329EfN3SFhsveA20ei7jC7JMFHAm8Ru?orgId=1

If you want to see the state of distribution links & dist process state (they are all green btw), check the point-in-time metrics (they will expire in 15 days from today): https://grafana.gcp.rabbitmq.com/d/d-SFCCmZz/erlang-distribution?from=1560775955127&to=1560779424482

How can we tell what is preventing the distribution link from using all available bandwidth?

Are we missing a configuration flag? These are all the relevant beam.smp flags that we are using: https://github.com/erlang/otp/pull/2270#issuecomment-500953352

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Erlang distribution links don't fully utilise available resources - OTP 22.0.2 - Why?

Gerhard Lazu
I wouldn't expect the Erlang distribution to reach the same network performance as iperf, but I would expect it to be within 70%-80% of maximum.

In our measurements it's within 27% of maximum, which makes me believe that something is misconfigured or inefficient.

The goal is to figure out which component/components are responsible for this significant network throughput loss.

Thanks for the quick response!

On Mon, Jun 17, 2019 at 4:02 PM Dmytro Lytovchenko <[hidden email]> wrote:
I believe the Erlang distribution is the wrong thing to use if you want to saturate the network.
There is plenty of overhead for each incoming message, the data gets copied, then encoded (copied again) then sent, then received (copied), then decoded (copied again) and sent to the destination process (copied again). Then the receiving processes might be slow to fetch the incoming data, they aren't running in hard real time and sometimes go to sleep.


I remember there were suggestions to use regular TCP connections, consider using user-mode driver (kernel calls have a cost) and low level NIF driver for that, with the intent of delivering highest gigabits from your hardware.

On Mon, 17 Jun 2019 at 16:49, Gerhard Lazu <[hidden email]> wrote:

Hi,

We are trying to understand what prevents the Erlang distribution link from saturating the network. Even though there is plenty of CPU, memory & network bandwidth, the Erlang distribution doesn't fully utilise available resources. Can you help us figure out why?

We have a 3-node Erlang 22.0.2 cluster running on Ubuntu 16.04 x86 64bit.

This is the maximum network throughput between node-a & node-b, as measured by iperf:

iperf -t 30 -c node-b
------------------------------------------------------------
Client connecting to 10.0.1.37, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  3] local 10.0.1.36 port 43576 connected with 10.0.1.37 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-30.0 sec  78.8 GBytes  22.6 Gbits/sec

We ran this multiple times, in different directions & with different degree of parallelism, the maximum network throughput is roughly 22 Gbit/s.

We run the following command on node-a:

B = fun F() -> rpc:cast('foo@node-b', erlang, is_binary, [<<0:10000000/unit:8>>]), F() end.
[spawn(fun() -> B() end) || _ <- lists:seq(1, 100)].

This is what the network reports on node-a:

dstat -n 1 10
-net/total-
 recv  send
   0     0
 676k  756M
 643k  767M
 584k  679M
 693k  777M
 648k  745M
 660k  745M
 667k  772M
 651k  709M
 675k  782M
 688k  819M

That roughly translates to 6 Gbit/s. In other words, the Erlang distribution link between node-a & node-b is maxing out at around ~6 Gbit/s. Erlang distribution is limited to 27% of what we are measuring consistently and repeatedly outside of Erlang. In other words, iperf is 3.6x faster than an Erlang distribution link. It gets better.

If we start another 100 processes pumping 10Mbyte messages from node-a to node-c, we see the network throughput double:

dstat -n 1 10
-net/total-
 recv  send
   0     0
1303k 1463M
1248k 1360M
1332k 1458M
1480k 1569M
1339k 1455M
1413k 1494M
1395k 1431M
1359k 1514M
1438k 1564M
1379k 1489M

So 2 distribution links - each to a separate node - utilise 12Gbit/s out of the 22Gbit/s available on node-a.

What is preventing the distribution links pushing more data through? There is plenty of CPU & memory available (all nodes have 16 CPUs & 104GB MEM - n1-highmem-16):

dstat -cm 1 10
----total-cpu-usage---- ------memory-usage-----
usr sys idl wai hiq siq| used  buff  cach  free
 10   6  84   0   0   1|16.3G  118M  284M 85.6G
 20   6  73   0   0   1|16.3G  118M  284M 85.6G
 20   6  74   0   0   0|16.3G  118M  284M 85.6G
 18   6  76   0   0   0|16.4G  118M  284M 85.5G
 19   6  74   0   0   1|16.4G  118M  284M 85.4G
 17   4  78   0   0   0|16.5G  118M  284M 85.4G
 20   6  74   0   0   0|16.5G  118M  284M 85.4G
 19   6  74   0   0   0|16.5G  118M  284M 85.4G
 19   5  76   0   0   1|16.5G  118M  284M 85.4G
 18   6  75   0   0   0|16.5G  118M  284M 85.4G
 18   6  75   0   0   0|16.6G  118M  284M 85.3G

The only smoking gun is the distribution output queue buffer: https://grafana.gcp.rabbitmq.com/dashboard/snapshot/H329EfN3SFhsveA20ei7jC7JMFHAm8Ru?orgId=1&fullscreen&panelId=62

Speaking of which, we look forward to erlang/otp#2270 being merged: https://github.com/erlang/otp/pull/2270

All distribution metrics are available here: https://grafana.gcp.rabbitmq.com/dashboard/snapshot/H329EfN3SFhsveA20ei7jC7JMFHAm8Ru?orgId=1

If you want to see the state of distribution links & dist process state (they are all green btw), check the point-in-time metrics (they will expire in 15 days from today): https://grafana.gcp.rabbitmq.com/d/d-SFCCmZz/erlang-distribution?from=1560775955127&to=1560779424482

How can we tell what is preventing the distribution link from using all available bandwidth?

Are we missing a configuration flag? These are all the relevant beam.smp flags that we are using: https://github.com/erlang/otp/pull/2270#issuecomment-500953352

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Erlang distribution links don't fully utilise available resources - OTP 22.0.2 - Why?

Dmytro Lytovchenko
Consider that the VM copies the data at least 4 or 5 times, and compare that with Gbit/s of your RAM on both servers too.
Plus eventual garbage collection, which can be minimal if your VM memory footprint is small and you're only doing this perf testing.

On Mon, 17 Jun 2019 at 17:08, Gerhard Lazu <[hidden email]> wrote:
I wouldn't expect the Erlang distribution to reach the same network performance as iperf, but I would expect it to be within 70%-80% of maximum.

In our measurements it's within 27% of maximum, which makes me believe that something is misconfigured or inefficient.

The goal is to figure out which component/components are responsible for this significant network throughput loss.

Thanks for the quick response!

On Mon, Jun 17, 2019 at 4:02 PM Dmytro Lytovchenko <[hidden email]> wrote:
I believe the Erlang distribution is the wrong thing to use if you want to saturate the network.
There is plenty of overhead for each incoming message, the data gets copied, then encoded (copied again) then sent, then received (copied), then decoded (copied again) and sent to the destination process (copied again). Then the receiving processes might be slow to fetch the incoming data, they aren't running in hard real time and sometimes go to sleep.


I remember there were suggestions to use regular TCP connections, consider using user-mode driver (kernel calls have a cost) and low level NIF driver for that, with the intent of delivering highest gigabits from your hardware.

On Mon, 17 Jun 2019 at 16:49, Gerhard Lazu <[hidden email]> wrote:

Hi,

We are trying to understand what prevents the Erlang distribution link from saturating the network. Even though there is plenty of CPU, memory & network bandwidth, the Erlang distribution doesn't fully utilise available resources. Can you help us figure out why?

We have a 3-node Erlang 22.0.2 cluster running on Ubuntu 16.04 x86 64bit.

This is the maximum network throughput between node-a & node-b, as measured by iperf:

iperf -t 30 -c node-b
------------------------------------------------------------
Client connecting to 10.0.1.37, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  3] local 10.0.1.36 port 43576 connected with 10.0.1.37 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-30.0 sec  78.8 GBytes  22.6 Gbits/sec

We ran this multiple times, in different directions & with different degree of parallelism, the maximum network throughput is roughly 22 Gbit/s.

We run the following command on node-a:

B = fun F() -> rpc:cast('foo@node-b', erlang, is_binary, [<<0:10000000/unit:8>>]), F() end.
[spawn(fun() -> B() end) || _ <- lists:seq(1, 100)].

This is what the network reports on node-a:

dstat -n 1 10
-net/total-
 recv  send
   0     0
 676k  756M
 643k  767M
 584k  679M
 693k  777M
 648k  745M
 660k  745M
 667k  772M
 651k  709M
 675k  782M
 688k  819M

That roughly translates to 6 Gbit/s. In other words, the Erlang distribution link between node-a & node-b is maxing out at around ~6 Gbit/s. Erlang distribution is limited to 27% of what we are measuring consistently and repeatedly outside of Erlang. In other words, iperf is 3.6x faster than an Erlang distribution link. It gets better.

If we start another 100 processes pumping 10Mbyte messages from node-a to node-c, we see the network throughput double:

dstat -n 1 10
-net/total-
 recv  send
   0     0
1303k 1463M
1248k 1360M
1332k 1458M
1480k 1569M
1339k 1455M
1413k 1494M
1395k 1431M
1359k 1514M
1438k 1564M
1379k 1489M

So 2 distribution links - each to a separate node - utilise 12Gbit/s out of the 22Gbit/s available on node-a.

What is preventing the distribution links pushing more data through? There is plenty of CPU & memory available (all nodes have 16 CPUs & 104GB MEM - n1-highmem-16):

dstat -cm 1 10
----total-cpu-usage---- ------memory-usage-----
usr sys idl wai hiq siq| used  buff  cach  free
 10   6  84   0   0   1|16.3G  118M  284M 85.6G
 20   6  73   0   0   1|16.3G  118M  284M 85.6G
 20   6  74   0   0   0|16.3G  118M  284M 85.6G
 18   6  76   0   0   0|16.4G  118M  284M 85.5G
 19   6  74   0   0   1|16.4G  118M  284M 85.4G
 17   4  78   0   0   0|16.5G  118M  284M 85.4G
 20   6  74   0   0   0|16.5G  118M  284M 85.4G
 19   6  74   0   0   0|16.5G  118M  284M 85.4G
 19   5  76   0   0   1|16.5G  118M  284M 85.4G
 18   6  75   0   0   0|16.5G  118M  284M 85.4G
 18   6  75   0   0   0|16.6G  118M  284M 85.3G

The only smoking gun is the distribution output queue buffer: https://grafana.gcp.rabbitmq.com/dashboard/snapshot/H329EfN3SFhsveA20ei7jC7JMFHAm8Ru?orgId=1&fullscreen&panelId=62

Speaking of which, we look forward to erlang/otp#2270 being merged: https://github.com/erlang/otp/pull/2270

All distribution metrics are available here: https://grafana.gcp.rabbitmq.com/dashboard/snapshot/H329EfN3SFhsveA20ei7jC7JMFHAm8Ru?orgId=1

If you want to see the state of distribution links & dist process state (they are all green btw), check the point-in-time metrics (they will expire in 15 days from today): https://grafana.gcp.rabbitmq.com/d/d-SFCCmZz/erlang-distribution?from=1560775955127&to=1560779424482

How can we tell what is preventing the distribution link from using all available bandwidth?

Are we missing a configuration flag? These are all the relevant beam.smp flags that we are using: https://github.com/erlang/otp/pull/2270#issuecomment-500953352

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Erlang distribution links don't fully utilise available resources - OTP 22.0.2 - Why?

Gerhard Lazu
Are you saying that due to the overhead associated with sending data over the distribution we are hitting the distribution link limit? We have measured this to be 6 Gbit/s. Can others confirm?

On Mon, Jun 17, 2019 at 4:11 PM Dmytro Lytovchenko <[hidden email]> wrote:
Consider that the VM copies the data at least 4 or 5 times, and compare that with Gbit/s of your RAM on both servers too.
Plus eventual garbage collection, which can be minimal if your VM memory footprint is small and you're only doing this perf testing.

On Mon, 17 Jun 2019 at 17:08, Gerhard Lazu <[hidden email]> wrote:
I wouldn't expect the Erlang distribution to reach the same network performance as iperf, but I would expect it to be within 70%-80% of maximum.

In our measurements it's within 27% of maximum, which makes me believe that something is misconfigured or inefficient.

The goal is to figure out which component/components are responsible for this significant network throughput loss.

Thanks for the quick response!

On Mon, Jun 17, 2019 at 4:02 PM Dmytro Lytovchenko <[hidden email]> wrote:
I believe the Erlang distribution is the wrong thing to use if you want to saturate the network.
There is plenty of overhead for each incoming message, the data gets copied, then encoded (copied again) then sent, then received (copied), then decoded (copied again) and sent to the destination process (copied again). Then the receiving processes might be slow to fetch the incoming data, they aren't running in hard real time and sometimes go to sleep.


I remember there were suggestions to use regular TCP connections, consider using user-mode driver (kernel calls have a cost) and low level NIF driver for that, with the intent of delivering highest gigabits from your hardware.

On Mon, 17 Jun 2019 at 16:49, Gerhard Lazu <[hidden email]> wrote:

Hi,

We are trying to understand what prevents the Erlang distribution link from saturating the network. Even though there is plenty of CPU, memory & network bandwidth, the Erlang distribution doesn't fully utilise available resources. Can you help us figure out why?

We have a 3-node Erlang 22.0.2 cluster running on Ubuntu 16.04 x86 64bit.

This is the maximum network throughput between node-a & node-b, as measured by iperf:

iperf -t 30 -c node-b
------------------------------------------------------------
Client connecting to 10.0.1.37, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  3] local 10.0.1.36 port 43576 connected with 10.0.1.37 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-30.0 sec  78.8 GBytes  22.6 Gbits/sec

We ran this multiple times, in different directions & with different degree of parallelism, the maximum network throughput is roughly 22 Gbit/s.

We run the following command on node-a:

B = fun F() -> rpc:cast('foo@node-b', erlang, is_binary, [<<0:10000000/unit:8>>]), F() end.
[spawn(fun() -> B() end) || _ <- lists:seq(1, 100)].

This is what the network reports on node-a:

dstat -n 1 10
-net/total-
 recv  send
   0     0
 676k  756M
 643k  767M
 584k  679M
 693k  777M
 648k  745M
 660k  745M
 667k  772M
 651k  709M
 675k  782M
 688k  819M

That roughly translates to 6 Gbit/s. In other words, the Erlang distribution link between node-a & node-b is maxing out at around ~6 Gbit/s. Erlang distribution is limited to 27% of what we are measuring consistently and repeatedly outside of Erlang. In other words, iperf is 3.6x faster than an Erlang distribution link. It gets better.

If we start another 100 processes pumping 10Mbyte messages from node-a to node-c, we see the network throughput double:

dstat -n 1 10
-net/total-
 recv  send
   0     0
1303k 1463M
1248k 1360M
1332k 1458M
1480k 1569M
1339k 1455M
1413k 1494M
1395k 1431M
1359k 1514M
1438k 1564M
1379k 1489M

So 2 distribution links - each to a separate node - utilise 12Gbit/s out of the 22Gbit/s available on node-a.

What is preventing the distribution links pushing more data through? There is plenty of CPU & memory available (all nodes have 16 CPUs & 104GB MEM - n1-highmem-16):

dstat -cm 1 10
----total-cpu-usage---- ------memory-usage-----
usr sys idl wai hiq siq| used  buff  cach  free
 10   6  84   0   0   1|16.3G  118M  284M 85.6G
 20   6  73   0   0   1|16.3G  118M  284M 85.6G
 20   6  74   0   0   0|16.3G  118M  284M 85.6G
 18   6  76   0   0   0|16.4G  118M  284M 85.5G
 19   6  74   0   0   1|16.4G  118M  284M 85.4G
 17   4  78   0   0   0|16.5G  118M  284M 85.4G
 20   6  74   0   0   0|16.5G  118M  284M 85.4G
 19   6  74   0   0   0|16.5G  118M  284M 85.4G
 19   5  76   0   0   1|16.5G  118M  284M 85.4G
 18   6  75   0   0   0|16.5G  118M  284M 85.4G
 18   6  75   0   0   0|16.6G  118M  284M 85.3G

The only smoking gun is the distribution output queue buffer: https://grafana.gcp.rabbitmq.com/dashboard/snapshot/H329EfN3SFhsveA20ei7jC7JMFHAm8Ru?orgId=1&fullscreen&panelId=62

Speaking of which, we look forward to erlang/otp#2270 being merged: https://github.com/erlang/otp/pull/2270

All distribution metrics are available here: https://grafana.gcp.rabbitmq.com/dashboard/snapshot/H329EfN3SFhsveA20ei7jC7JMFHAm8Ru?orgId=1

If you want to see the state of distribution links & dist process state (they are all green btw), check the point-in-time metrics (they will expire in 15 days from today): https://grafana.gcp.rabbitmq.com/d/d-SFCCmZz/erlang-distribution?from=1560775955127&to=1560779424482

How can we tell what is preventing the distribution link from using all available bandwidth?

Are we missing a configuration flag? These are all the relevant beam.smp flags that we are using: https://github.com/erlang/otp/pull/2270#issuecomment-500953352

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Erlang distribution links don't fully utilise available resources - OTP 22.0.2 - Why?

Max Lapshin-2
In reply to this post by Dmytro Lytovchenko
> Consider that the VM copies the data at least 4 or 5 times

This is not good.

I'm designing multi-node hardware appliance for our Flussonic and thought that it would be a good idea to interconnect  nodes via erlang distribution.

Seems that it would be better to create 2 channels: control and data?


On Mon, Jun 17, 2019 at 6:11 PM Dmytro Lytovchenko <[hidden email]> wrote:
Consider that the VM copies the data at least 4 or 5 times, and compare that with Gbit/s of your RAM on both servers too.
Plus eventual garbage collection, which can be minimal if your VM memory footprint is small and you're only doing this perf testing.

On Mon, 17 Jun 2019 at 17:08, Gerhard Lazu <[hidden email]> wrote:
I wouldn't expect the Erlang distribution to reach the same network performance as iperf, but I would expect it to be within 70%-80% of maximum.

In our measurements it's within 27% of maximum, which makes me believe that something is misconfigured or inefficient.

The goal is to figure out which component/components are responsible for this significant network throughput loss.

Thanks for the quick response!

On Mon, Jun 17, 2019 at 4:02 PM Dmytro Lytovchenko <[hidden email]> wrote:
I believe the Erlang distribution is the wrong thing to use if you want to saturate the network.
There is plenty of overhead for each incoming message, the data gets copied, then encoded (copied again) then sent, then received (copied), then decoded (copied again) and sent to the destination process (copied again). Then the receiving processes might be slow to fetch the incoming data, they aren't running in hard real time and sometimes go to sleep.


I remember there were suggestions to use regular TCP connections, consider using user-mode driver (kernel calls have a cost) and low level NIF driver for that, with the intent of delivering highest gigabits from your hardware.

On Mon, 17 Jun 2019 at 16:49, Gerhard Lazu <[hidden email]> wrote:

Hi,

We are trying to understand what prevents the Erlang distribution link from saturating the network. Even though there is plenty of CPU, memory & network bandwidth, the Erlang distribution doesn't fully utilise available resources. Can you help us figure out why?

We have a 3-node Erlang 22.0.2 cluster running on Ubuntu 16.04 x86 64bit.

This is the maximum network throughput between node-a & node-b, as measured by iperf:

iperf -t 30 -c node-b
------------------------------------------------------------
Client connecting to 10.0.1.37, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  3] local 10.0.1.36 port 43576 connected with 10.0.1.37 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-30.0 sec  78.8 GBytes  22.6 Gbits/sec

We ran this multiple times, in different directions & with different degree of parallelism, the maximum network throughput is roughly 22 Gbit/s.

We run the following command on node-a:

B = fun F() -> rpc:cast('foo@node-b', erlang, is_binary, [<<0:10000000/unit:8>>]), F() end.
[spawn(fun() -> B() end) || _ <- lists:seq(1, 100)].

This is what the network reports on node-a:

dstat -n 1 10
-net/total-
 recv  send
   0     0
 676k  756M
 643k  767M
 584k  679M
 693k  777M
 648k  745M
 660k  745M
 667k  772M
 651k  709M
 675k  782M
 688k  819M

That roughly translates to 6 Gbit/s. In other words, the Erlang distribution link between node-a & node-b is maxing out at around ~6 Gbit/s. Erlang distribution is limited to 27% of what we are measuring consistently and repeatedly outside of Erlang. In other words, iperf is 3.6x faster than an Erlang distribution link. It gets better.

If we start another 100 processes pumping 10Mbyte messages from node-a to node-c, we see the network throughput double:

dstat -n 1 10
-net/total-
 recv  send
   0     0
1303k 1463M
1248k 1360M
1332k 1458M
1480k 1569M
1339k 1455M
1413k 1494M
1395k 1431M
1359k 1514M
1438k 1564M
1379k 1489M

So 2 distribution links - each to a separate node - utilise 12Gbit/s out of the 22Gbit/s available on node-a.

What is preventing the distribution links pushing more data through? There is plenty of CPU & memory available (all nodes have 16 CPUs & 104GB MEM - n1-highmem-16):

dstat -cm 1 10
----total-cpu-usage---- ------memory-usage-----
usr sys idl wai hiq siq| used  buff  cach  free
 10   6  84   0   0   1|16.3G  118M  284M 85.6G
 20   6  73   0   0   1|16.3G  118M  284M 85.6G
 20   6  74   0   0   0|16.3G  118M  284M 85.6G
 18   6  76   0   0   0|16.4G  118M  284M 85.5G
 19   6  74   0   0   1|16.4G  118M  284M 85.4G
 17   4  78   0   0   0|16.5G  118M  284M 85.4G
 20   6  74   0   0   0|16.5G  118M  284M 85.4G
 19   6  74   0   0   0|16.5G  118M  284M 85.4G
 19   5  76   0   0   1|16.5G  118M  284M 85.4G
 18   6  75   0   0   0|16.5G  118M  284M 85.4G
 18   6  75   0   0   0|16.6G  118M  284M 85.3G

The only smoking gun is the distribution output queue buffer: https://grafana.gcp.rabbitmq.com/dashboard/snapshot/H329EfN3SFhsveA20ei7jC7JMFHAm8Ru?orgId=1&fullscreen&panelId=62

Speaking of which, we look forward to erlang/otp#2270 being merged: https://github.com/erlang/otp/pull/2270

All distribution metrics are available here: https://grafana.gcp.rabbitmq.com/dashboard/snapshot/H329EfN3SFhsveA20ei7jC7JMFHAm8Ru?orgId=1

If you want to see the state of distribution links & dist process state (they are all green btw), check the point-in-time metrics (they will expire in 15 days from today): https://grafana.gcp.rabbitmq.com/d/d-SFCCmZz/erlang-distribution?from=1560775955127&to=1560779424482

How can we tell what is preventing the distribution link from using all available bandwidth?

Are we missing a configuration flag? These are all the relevant beam.smp flags that we are using: https://github.com/erlang/otp/pull/2270#issuecomment-500953352

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Erlang distribution links don't fully utilise available resources - OTP 22.0.2 - Why?

Scott Lystig Fritchie
I've not tried using Partisan for the kind of single sender -> single receiver large blob throughput that y'all are attempting, but Partisan might make better use of high speed local networks in your case.


-Scott (standing in for Chris Meiklejohn ^_^)


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Erlang distribution links don't fully utilise available resources - OTP 22.0.2 - Why?

Karl Nilsson-2
I work with Gerhard (OP) and just thought I'd chime in.

Thanks for the alternative suggestions of using sockets / partisan. At some point we will look at this but for now we are working on making the best use possible of vanilla distributed erlang links. We do understand there is overhead and inherent inefficiencies with using distributed erlang for data transfer but isn't it surprising / concerning anyone else that it is only able to make use of 27% of total network potential? Would you not prefer it to be, say, 50% or even 75% if this is theoretically possible?

So what we'd like to find out is whether there are genuine unsurmountable reasons for this or something that can be improved, either by tweaking config or by optimisations in the distribution layer.

We will run more tests to get more data to reason about. If anyone has suggestions for good tests to run / metrics to inspect that would be very welcome.

Cheers
Karl

On Mon, 17 Jun 2019 at 23:01, Scott Lystig Fritchie <[hidden email]> wrote:
I've not tried using Partisan for the kind of single sender -> single receiver large blob throughput that y'all are attempting, but Partisan might make better use of high speed local networks in your case.


-Scott (standing in for Chris Meiklejohn ^_^)

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions


--
Karl Nilsson

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Erlang distribution links don't fully utilise available resources - OTP 22.0.2 - Why?

Dmytro Lytovchenko
Have a look at your
  • Linux kernel TCP settings
  • Memory throughput Gbit/s (the data is copied several times for sending and receiving)
  • Head of the line problem (if receiver doesn't read fast enough, the memory will be used to store the unconsumed data)
  • Message size might make some difference, OTP 22 splits packets over 64k into segments, so they also have to be stored before assembling in the destination
  • Kernel CPU time vs User CPU time (network driver and kernel overhead)

On Tue, 18 Jun 2019 at 10:43, Karl Nilsson <[hidden email]> wrote:
I work with Gerhard (OP) and just thought I'd chime in.

Thanks for the alternative suggestions of using sockets / partisan. At some point we will look at this but for now we are working on making the best use possible of vanilla distributed erlang links. We do understand there is overhead and inherent inefficiencies with using distributed erlang for data transfer but isn't it surprising / concerning anyone else that it is only able to make use of 27% of total network potential? Would you not prefer it to be, say, 50% or even 75% if this is theoretically possible?

So what we'd like to find out is whether there are genuine unsurmountable reasons for this or something that can be improved, either by tweaking config or by optimisations in the distribution layer.

We will run more tests to get more data to reason about. If anyone has suggestions for good tests to run / metrics to inspect that would be very welcome.

Cheers
Karl

On Mon, 17 Jun 2019 at 23:01, Scott Lystig Fritchie <[hidden email]> wrote:
I've not tried using Partisan for the kind of single sender -> single receiver large blob throughput that y'all are attempting, but Partisan might make better use of high speed local networks in your case.


-Scott (standing in for Chris Meiklejohn ^_^)

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions


--
Karl Nilsson
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Erlang distribution links don't fully utilise available resources - OTP 22.0.2 - Why?

Lukas Larsson-8
In reply to this post by Gerhard Lazu


On Mon, Jun 17, 2019 at 4:49 PM Gerhard Lazu <[hidden email]> wrote:


B = fun F() -> rpc:cast('foo@node-b', erlang, is_binary, [<<0:10000000/unit:8>>]), F() end.
[spawn(fun() -> B() end) || _ <- lists:seq(1, 100)].

I wrote this code that is better able to saturate the network. Using rpc will create bottlenecks on the receiving side as it is the receiving process that does the decode.  I get about 20% more traffic through when I do that, which is not as much as I was expecting.

-module(dist_perf).

-export([go/0,go/1]).

go() ->
    go(100).
go(N) ->
    spawn_link(fun stats/0),
    RemoteNode = 'bar@elxd3291v0k',
    Pids = [spawn(RemoteNode, fun F() -> receive {From, Msg} -> From ! is_binary(Msg), F() end end)
            || _<- lists:seq(1,N)],
    Payload = <<0:10000000/unit:8>>,
    B = fun F(Pid) -> Pid ! {self(),Payload}, receive true -> F(Pid) end end,
    [spawn(fun() -> B(Pid) end) || Pid <- Pids ].


stats() ->
    {{_input,T0Inp},{_output,T0Out}} = erlang:statistics(io),
    timer:sleep(5000),
    {{_input,T1Inp},{_output,T1Out}} = erlang:statistics(io),
    io:format("Sent ~pMB/s, Recv ~pMB/s~n",
              [(T1Out - T0Out) / 1024 / 1024 / 5,
               (T1Inp - T0Inp) / 1024 / 1024 / 5]),
    stats().
    
I fired up linux perf to have a look at what was taking time and it is (as Dmytro said) the copying of data that takes time. There is one specific places that you hit with your test very badly: https://github.com/erlang/otp/blob/master/erts/emulator/beam/dist.c#L1565

This is the re-assembly of fragmented distribution messages. This happens because the messages sent are > 64kb. I change that limit to be 64MB on my machine that that doubled the throughput (https://github.com/erlang/otp/blob/master/erts/emulator/beam/dist.h#L150).

The reason why the re-assembly has to be done is that the emulator does not have a variant of erlang:binary_to_term/1 that takes an iovector as input. This could definitely be fixed and is something that we are planning on fixing, though it is not something we are working on right now. However, it will only change things in production if you send > 64KB messages.

Also, just to be clear, the copying that is done by the distribution layer is:

1) use term_to_binary to create a buffer
2) use writev in inet_driver to copy buffer to kernel
3) use recv to copy data from kernel into a refc binary
4) if needed, use memcpy re-assemble fragments
5) use binary_to_term to decode term

(there is no copy to the destination process as Dmyto said)

Before OTP-22 there was an additional copy at the receiving side, but that has been removed. So imo there is only one copy that could potentially be removed.

If you want to decrease the overhead of saturating the network with many small messages I would implement something like read_packets for tcp in the inet_driver. However, the real bottleneck quickly becomes memory allocations as each message needs two or three allocations, and when you receive a lot of messages those allocations add up.

Lukas

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Erlang distribution links don't fully utilise available resources - OTP 22.0.2 - Why?

Jesper Louis Andersen-2
In reply to this post by Gerhard Lazu
On Mon, Jun 17, 2019 at 4:49 PM Gerhard Lazu <[hidden email]> wrote:

Hi,

We are trying to understand what prevents the Erlang distribution link from saturating the network. Even though there is plenty of CPU, memory & network bandwidth, the Erlang distribution

You need to instrument. Try running something like perf(1) on the emulator and look where it is spending its time. This will be able to either confirm or reject Dmytro's viewpoint that you are looking at excessive copying.



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Erlang distribution links don't fully utilise available resources - OTP 22.0.2 - Why?

Gerhard Lazu
In reply to this post by Lukas Larsson-8
Some great replies, thank you all.

I'm putting it all together now and re-running on my end.

Will update as I capture various observations.

On Tue, Jun 18, 2019 at 11:07 AM Lukas Larsson <[hidden email]> wrote:


On Mon, Jun 17, 2019 at 4:49 PM Gerhard Lazu <[hidden email]> wrote:


B = fun F() -> rpc:cast('foo@node-b', erlang, is_binary, [<<0:10000000/unit:8>>]), F() end.
[spawn(fun() -> B() end) || _ <- lists:seq(1, 100)].

I wrote this code that is better able to saturate the network. Using rpc will create bottlenecks on the receiving side as it is the receiving process that does the decode.  I get about 20% more traffic through when I do that, which is not as much as I was expecting.

-module(dist_perf).

-export([go/0,go/1]).

go() ->
    go(100).
go(N) ->
    spawn_link(fun stats/0),
    RemoteNode = 'bar@elxd3291v0k',
    Pids = [spawn(RemoteNode, fun F() -> receive {From, Msg} -> From ! is_binary(Msg), F() end end)
            || _<- lists:seq(1,N)],
    Payload = <<0:10000000/unit:8>>,
    B = fun F(Pid) -> Pid ! {self(),Payload}, receive true -> F(Pid) end end,
    [spawn(fun() -> B(Pid) end) || Pid <- Pids ].


stats() ->
    {{_input,T0Inp},{_output,T0Out}} = erlang:statistics(io),
    timer:sleep(5000),
    {{_input,T1Inp},{_output,T1Out}} = erlang:statistics(io),
    io:format("Sent ~pMB/s, Recv ~pMB/s~n",
              [(T1Out - T0Out) / 1024 / 1024 / 5,
               (T1Inp - T0Inp) / 1024 / 1024 / 5]),
    stats().
    
I fired up linux perf to have a look at what was taking time and it is (as Dmytro said) the copying of data that takes time. There is one specific places that you hit with your test very badly: https://github.com/erlang/otp/blob/master/erts/emulator/beam/dist.c#L1565

This is the re-assembly of fragmented distribution messages. This happens because the messages sent are > 64kb. I change that limit to be 64MB on my machine that that doubled the throughput (https://github.com/erlang/otp/blob/master/erts/emulator/beam/dist.h#L150).

The reason why the re-assembly has to be done is that the emulator does not have a variant of erlang:binary_to_term/1 that takes an iovector as input. This could definitely be fixed and is something that we are planning on fixing, though it is not something we are working on right now. However, it will only change things in production if you send > 64KB messages.

Also, just to be clear, the copying that is done by the distribution layer is:

1) use term_to_binary to create a buffer
2) use writev in inet_driver to copy buffer to kernel
3) use recv to copy data from kernel into a refc binary
4) if needed, use memcpy re-assemble fragments
5) use binary_to_term to decode term

(there is no copy to the destination process as Dmyto said)

Before OTP-22 there was an additional copy at the receiving side, but that has been removed. So imo there is only one copy that could potentially be removed.

If you want to decrease the overhead of saturating the network with many small messages I would implement something like read_packets for tcp in the inet_driver. However, the real bottleneck quickly becomes memory allocations as each message needs two or three allocations, and when you receive a lot of messages those allocations add up.

Lukas

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Erlang distribution links don't fully utilise available resources - OTP 22.0.2 - Why?

Gerhard Lazu
In reply to this post by Lukas Larsson-8
TL;DR Single distribution link was measured to peak at 14.4 Gbit/s, which is ~50% from network maximum, 28.5 Gbit/s.

I'm confirming Lukas' observations: using smaller payload sizes can result in up to 14.4 Gbit/s throughput per distribution link.

Payloads of both 60KB & 120KB yield the same max throughput of 14.4 Gbit/s.

When message payloads go from 1MB to 4MB, throughput drops from 12.8 Gbit/s to 9.6 Gbit/s.

With 10MB payloads, distribution link maxes out at 7.6 Gbit/s. This is interesting because yesterday I couldn't get it above 6 Gbit/s.
Since this is GCP, on paper each CPU gets 2 Gbit/s. At 16 CPUs we should be maxing out at 32 Gbit/s.
I have just benchmarked with iperf, and today these fresh VMs are peaking at 28.5 Gbit/s (yesterday they were 22.5 Gbit/s).

At 100MB payloads, distribution link makes out at 4.5 Gbit/s.

All distribution-related metrics, including annotations for various message payloads can be found here: https://grafana.gcp.rabbitmq.com/dashboard/snapshot/SDr2EWgaD2KOX5i0H154UvzoHRS68FV0?orgId=1

On Tue, Jun 18, 2019 at 11:07 AM Lukas Larsson <[hidden email]> wrote:


On Mon, Jun 17, 2019 at 4:49 PM Gerhard Lazu <[hidden email]> wrote:


B = fun F() -> rpc:cast('foo@node-b', erlang, is_binary, [<<0:10000000/unit:8>>]), F() end.
[spawn(fun() -> B() end) || _ <- lists:seq(1, 100)].

I wrote this code that is better able to saturate the network. Using rpc will create bottlenecks on the receiving side as it is the receiving process that does the decode.  I get about 20% more traffic through when I do that, which is not as much as I was expecting.

-module(dist_perf).

-export([go/0,go/1]).

go() ->
    go(100).
go(N) ->
    spawn_link(fun stats/0),
    RemoteNode = 'bar@elxd3291v0k',
    Pids = [spawn(RemoteNode, fun F() -> receive {From, Msg} -> From ! is_binary(Msg), F() end end)
            || _<- lists:seq(1,N)],
    Payload = <<0:10000000/unit:8>>,
    B = fun F(Pid) -> Pid ! {self(),Payload}, receive true -> F(Pid) end end,
    [spawn(fun() -> B(Pid) end) || Pid <- Pids ].


stats() ->
    {{_input,T0Inp},{_output,T0Out}} = erlang:statistics(io),
    timer:sleep(5000),
    {{_input,T1Inp},{_output,T1Out}} = erlang:statistics(io),
    io:format("Sent ~pMB/s, Recv ~pMB/s~n",
              [(T1Out - T0Out) / 1024 / 1024 / 5,
               (T1Inp - T0Inp) / 1024 / 1024 / 5]),
    stats().
    
I fired up linux perf to have a look at what was taking time and it is (as Dmytro said) the copying of data that takes time. There is one specific places that you hit with your test very badly: https://github.com/erlang/otp/blob/master/erts/emulator/beam/dist.c#L1565

This is the re-assembly of fragmented distribution messages. This happens because the messages sent are > 64kb. I change that limit to be 64MB on my machine that that doubled the throughput (https://github.com/erlang/otp/blob/master/erts/emulator/beam/dist.h#L150).

The reason why the re-assembly has to be done is that the emulator does not have a variant of erlang:binary_to_term/1 that takes an iovector as input. This could definitely be fixed and is something that we are planning on fixing, though it is not something we are working on right now. However, it will only change things in production if you send > 64KB messages.

Also, just to be clear, the copying that is done by the distribution layer is:

1) use term_to_binary to create a buffer
2) use writev in inet_driver to copy buffer to kernel
3) use recv to copy data from kernel into a refc binary
4) if needed, use memcpy re-assemble fragments
5) use binary_to_term to decode term

(there is no copy to the destination process as Dmyto said)

Before OTP-22 there was an additional copy at the receiving side, but that has been removed. So imo there is only one copy that could potentially be removed.

If you want to decrease the overhead of saturating the network with many small messages I would implement something like read_packets for tcp in the inet_driver. However, the real bottleneck quickly becomes memory allocations as each message needs two or three allocations, and when you receive a lot of messages those allocations add up.

Lukas

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Erlang distribution links don't fully utilise available resources - OTP 22.0.2 - Why?

Lukas Larsson-8
You made me curious so I did some additional optimizations: https://github.com/erlang/otp/pull/2291

They added an additional 20-25% for me.

On Tue, Jun 18, 2019 at 2:05 PM Gerhard Lazu <[hidden email]> wrote:
TL;DR Single distribution link was measured to peak at 14.4 Gbit/s, which is ~50% from network maximum, 28.5 Gbit/s.

I'm confirming Lukas' observations: using smaller payload sizes can result in up to 14.4 Gbit/s throughput per distribution link.

Payloads of both 60KB & 120KB yield the same max throughput of 14.4 Gbit/s.

When message payloads go from 1MB to 4MB, throughput drops from 12.8 Gbit/s to 9.6 Gbit/s.

With 10MB payloads, distribution link maxes out at 7.6 Gbit/s. This is interesting because yesterday I couldn't get it above 6 Gbit/s.
Since this is GCP, on paper each CPU gets 2 Gbit/s. At 16 CPUs we should be maxing out at 32 Gbit/s.
I have just benchmarked with iperf, and today these fresh VMs are peaking at 28.5 Gbit/s (yesterday they were 22.5 Gbit/s).

At 100MB payloads, distribution link makes out at 4.5 Gbit/s.

All distribution-related metrics, including annotations for various message payloads can be found here: https://grafana.gcp.rabbitmq.com/dashboard/snapshot/SDr2EWgaD2KOX5i0H154UvzoHRS68FV0?orgId=1

On Tue, Jun 18, 2019 at 11:07 AM Lukas Larsson <[hidden email]> wrote:


On Mon, Jun 17, 2019 at 4:49 PM Gerhard Lazu <[hidden email]> wrote:


B = fun F() -> rpc:cast('foo@node-b', erlang, is_binary, [<<0:10000000/unit:8>>]), F() end.
[spawn(fun() -> B() end) || _ <- lists:seq(1, 100)].

I wrote this code that is better able to saturate the network. Using rpc will create bottlenecks on the receiving side as it is the receiving process that does the decode.  I get about 20% more traffic through when I do that, which is not as much as I was expecting.

-module(dist_perf).

-export([go/0,go/1]).

go() ->
    go(100).
go(N) ->
    spawn_link(fun stats/0),
    RemoteNode = 'bar@elxd3291v0k',
    Pids = [spawn(RemoteNode, fun F() -> receive {From, Msg} -> From ! is_binary(Msg), F() end end)
            || _<- lists:seq(1,N)],
    Payload = <<0:10000000/unit:8>>,
    B = fun F(Pid) -> Pid ! {self(),Payload}, receive true -> F(Pid) end end,
    [spawn(fun() -> B(Pid) end) || Pid <- Pids ].


stats() ->
    {{_input,T0Inp},{_output,T0Out}} = erlang:statistics(io),
    timer:sleep(5000),
    {{_input,T1Inp},{_output,T1Out}} = erlang:statistics(io),
    io:format("Sent ~pMB/s, Recv ~pMB/s~n",
              [(T1Out - T0Out) / 1024 / 1024 / 5,
               (T1Inp - T0Inp) / 1024 / 1024 / 5]),
    stats().
    
I fired up linux perf to have a look at what was taking time and it is (as Dmytro said) the copying of data that takes time. There is one specific places that you hit with your test very badly: https://github.com/erlang/otp/blob/master/erts/emulator/beam/dist.c#L1565

This is the re-assembly of fragmented distribution messages. This happens because the messages sent are > 64kb. I change that limit to be 64MB on my machine that that doubled the throughput (https://github.com/erlang/otp/blob/master/erts/emulator/beam/dist.h#L150).

The reason why the re-assembly has to be done is that the emulator does not have a variant of erlang:binary_to_term/1 that takes an iovector as input. This could definitely be fixed and is something that we are planning on fixing, though it is not something we are working on right now. However, it will only change things in production if you send > 64KB messages.

Also, just to be clear, the copying that is done by the distribution layer is:

1) use term_to_binary to create a buffer
2) use writev in inet_driver to copy buffer to kernel
3) use recv to copy data from kernel into a refc binary
4) if needed, use memcpy re-assemble fragments
5) use binary_to_term to decode term

(there is no copy to the destination process as Dmyto said)

Before OTP-22 there was an additional copy at the receiving side, but that has been removed. So imo there is only one copy that could potentially be removed.

If you want to decrease the overhead of saturating the network with many small messages I would implement something like read_packets for tcp in the inet_driver. However, the real bottleneck quickly becomes memory allocations as each message needs two or three allocations, and when you receive a lot of messages those allocations add up.

Lukas
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Erlang distribution links don't fully utilise available resources - OTP 22.0.2 - Why?

Gerhard Lazu
Someone's on a roll today!

Pulling the patch now and giving it a run for its money.

Would it be worth benchmarking OTP v21.3.8.4?

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Erlang distribution links don't fully utilise available resources - OTP 22.0.2 - Why?

Gerhard Lazu
https://github.com/erlang/otp/pull/2291 is an improvement for larger payloads, not so much for smaller ones.

60 KB - 13.6 Gbit/s (down from 14.4 Gbit/s)
120 KB - 12.8 Gbit/s (down from 14.4 Gbit/s)
10MB - 8.8 Gbit/s (up from 7.6 Gbit/s)
100MB - 5.8 Gbit/s (up from 4.3 Gbit/s)



On Tue, Jun 18, 2019 at 2:34 PM Gerhard Lazu <[hidden email]> wrote:
Someone's on a roll today!

Pulling the patch now and giving it a run for its money.

Would it be worth benchmarking OTP v21.3.8.4?

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Erlang distribution links don't fully utilise available resources - OTP 22.0.2 - Why?

Michael Truog
On 6/18/19 2:14 PM, Gerhard Lazu wrote:

> https://github.com/erlang/otp/pull/2291 is an improvement for larger
> payloads, not so much for smaller ones.
>
> 60 KB - 13.6 Gbit/s (down from 14.4 Gbit/s)
> 120 KB - 12.8 Gbit/s (down from 14.4 Gbit/s)
> 10MB - 8.8 Gbit/s (up from 7.6 Gbit/s)
> 100MB - 5.8 Gbit/s (up from 4.3 Gbit/s)
>
> This is OTP 22.0.2
> https://grafana.gcp.rabbitmq.com/dashboard/snapshot/SDr2EWgaD2KOX5i0H154UvzoHRS68FV0
>
> This is OTP 22.0.3 + PR 2291
> https://grafana.gcp.rabbitmq.com/dashboard/snapshot/8yxY61o140gvAKXFthgIVCes2oW2Owa7 
>
>
> Will try https://github.com/erlang/otp/pull/2293 tomorrow
>
You should also try using the erl command line option +zdbbl when
testing larger payloads (it defaults to 1024 KB with valid values from 1
- 2097151 KB).

Best Regards,
Michael
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Erlang distribution links don't fully utilise available resources - OTP 22.0.2 - Why?

Gerhard Lazu
Our zdbbl defaults to 128000. This PR exposes how much of the zdbbl is being used: https://github.com/erlang/otp/pull/2270

We've tried larger values - 1280000 - which increase memory usage and make throughput marginally better (~1%), but spikier.



On Tue, Jun 18, 2019 at 10:40 PM Michael Truog <[hidden email]> wrote:
On 6/18/19 2:14 PM, Gerhard Lazu wrote:
> https://github.com/erlang/otp/pull/2291 is an improvement for larger
> payloads, not so much for smaller ones.
>
> 60 KB - 13.6 Gbit/s (down from 14.4 Gbit/s)
> 120 KB - 12.8 Gbit/s (down from 14.4 Gbit/s)
> 10MB - 8.8 Gbit/s (up from 7.6 Gbit/s)
> 100MB - 5.8 Gbit/s (up from 4.3 Gbit/s)
>
> This is OTP 22.0.2
> https://grafana.gcp.rabbitmq.com/dashboard/snapshot/SDr2EWgaD2KOX5i0H154UvzoHRS68FV0
>
> This is OTP 22.0.3 + PR 2291
> https://grafana.gcp.rabbitmq.com/dashboard/snapshot/8yxY61o140gvAKXFthgIVCes2oW2Owa7
>
>
> Will try https://github.com/erlang/otp/pull/2293 tomorrow
>
You should also try using the erl command line option +zdbbl when
testing larger payloads (it defaults to 1024 KB with valid values from 1
- 2097151 KB).

Best Regards,
Michael

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Erlang distribution links don't fully utilise available resources - OTP 22.0.2 - Why?

Lukas Larsson-8
In reply to this post by Gerhard Lazu


On Tue, Jun 18, 2019 at 11:15 PM Gerhard Lazu <[hidden email]> wrote:
https://github.com/erlang/otp/pull/2291 is an improvement for larger payloads, not so much for smaller ones.

60 KB - 13.6 Gbit/s (down from 14.4 Gbit/s)
120 KB - 12.8 Gbit/s (down from 14.4 Gbit/s)
10MB - 8.8 Gbit/s (up from 7.6 Gbit/s)
100MB - 5.8 Gbit/s (up from 4.3 Gbit/s)



Just keep in mind that these optimizations work great for when you are sending large binary terms. If you were to send the same data but as a list, you are back where you started again. So as always when you do benchmarks, make sure that you are testing with data that is close to what you expect your application to actually send.

Lukas

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions