|
|
Hi, We are trying to understand what prevents the Erlang distribution link from saturating the network. Even though there is plenty of CPU, memory & network bandwidth, the Erlang distribution doesn't fully utilise available resources. Can you help us figure out why? We have a 3-node Erlang 22.0.2 cluster running on Ubuntu 16.04 x86 64bit. This is the maximum network throughput between node-a & node-b, as measured by iperf: iperf -t 30 -c node-b
------------------------------------------------------------
Client connecting to 10.0.1.37, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[ 3] local 10.0.1.36 port 43576 connected with 10.0.1.37 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-30.0 sec 78.8 GBytes 22.6 Gbits/sec
We ran this multiple times, in different directions & with different degree of parallelism, the maximum network throughput is roughly 22 Gbit/s. We run the following command on node-a: B = fun F() -> rpc:cast('foo@node-b', erlang, is_binary, [<<0:10000000/unit:8>>]), F() end.
[spawn(fun() -> B() end) || _ <- lists:seq(1, 100)].
This is what the network reports on node-a: dstat -n 1 10
-net/total-
recv send
0 0
676k 756M
643k 767M
584k 679M
693k 777M
648k 745M
660k 745M
667k 772M
651k 709M
675k 782M
688k 819M
That roughly translates to 6 Gbit/s. In other words, the Erlang distribution link between node-a & node-b is maxing out at around ~6 Gbit/s. Erlang distribution is limited to 27% of what we are measuring consistently and repeatedly outside of Erlang. In other words, iperf is 3.6x faster than an Erlang distribution link. It gets better. If we start another 100 processes pumping 10Mbyte messages from node-a to node-c, we see the network throughput double: dstat -n 1 10
-net/total-
recv send
0 0
1303k 1463M
1248k 1360M
1332k 1458M
1480k 1569M
1339k 1455M
1413k 1494M
1395k 1431M
1359k 1514M
1438k 1564M
1379k 1489M
So 2 distribution links - each to a separate node - utilise 12Gbit/s out of the 22Gbit/s available on node-a. What is preventing the distribution links pushing more data through? There is plenty of CPU & memory available (all nodes have 16 CPUs & 104GB MEM - n1-highmem-16): dstat -cm 1 10
----total-cpu-usage---- ------memory-usage-----
usr sys idl wai hiq siq| used buff cach free
10 6 84 0 0 1|16.3G 118M 284M 85.6G
20 6 73 0 0 1|16.3G 118M 284M 85.6G
20 6 74 0 0 0|16.3G 118M 284M 85.6G
18 6 76 0 0 0|16.4G 118M 284M 85.5G
19 6 74 0 0 1|16.4G 118M 284M 85.4G
17 4 78 0 0 0|16.5G 118M 284M 85.4G
20 6 74 0 0 0|16.5G 118M 284M 85.4G
19 6 74 0 0 0|16.5G 118M 284M 85.4G
19 5 76 0 0 1|16.5G 118M 284M 85.4G
18 6 75 0 0 0|16.5G 118M 284M 85.4G
18 6 75 0 0 0|16.6G 118M 284M 85.3G
The only smoking gun is the distribution output queue buffer: https://grafana.gcp.rabbitmq.com/dashboard/snapshot/H329EfN3SFhsveA20ei7jC7JMFHAm8Ru?orgId=1&fullscreen&panelId=62 Speaking of which, we look forward to erlang/otp#2270 being merged: https://github.com/erlang/otp/pull/2270 All distribution metrics are available here: https://grafana.gcp.rabbitmq.com/dashboard/snapshot/H329EfN3SFhsveA20ei7jC7JMFHAm8Ru?orgId=1 If you want to see the state of distribution links & dist process state (they are all green btw), check the point-in-time metrics (they will expire in 15 days from today): https://grafana.gcp.rabbitmq.com/d/d-SFCCmZz/erlang-distribution?from=1560775955127&to=1560779424482 How can we tell what is preventing the distribution link from using all available bandwidth? Are we missing a configuration flag? These are all the relevant beam.smp flags that we are using: https://github.com/erlang/otp/pull/2270#issuecomment-500953352
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
I believe the Erlang distribution is the wrong thing to use if you want to saturate the network. There is plenty of overhead for each incoming message, the data gets copied, then encoded (copied again) then sent, then received (copied), then decoded (copied again) and sent to the destination process (copied again). Then the receiving processes might be slow to fetch the incoming data, they aren't running in hard real time and sometimes go to sleep.
I remember there were suggestions to use regular TCP connections, consider using user-mode driver (kernel calls have a cost) and low level NIF driver for that, with the intent of delivering highest gigabits from your hardware.
Hi, We are trying to understand what prevents the Erlang distribution link from saturating the network. Even though there is plenty of CPU, memory & network bandwidth, the Erlang distribution doesn't fully utilise available resources. Can you help us figure out why? We have a 3-node Erlang 22.0.2 cluster running on Ubuntu 16.04 x86 64bit. This is the maximum network throughput between node-a & node-b, as measured by iperf: iperf -t 30 -c node-b
------------------------------------------------------------
Client connecting to 10.0.1.37, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[ 3] local 10.0.1.36 port 43576 connected with 10.0.1.37 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-30.0 sec 78.8 GBytes 22.6 Gbits/sec
We ran this multiple times, in different directions & with different degree of parallelism, the maximum network throughput is roughly 22 Gbit/s. We run the following command on node-a: B = fun F() -> rpc:cast('foo@node-b', erlang, is_binary, [<<0:10000000/unit:8>>]), F() end.
[spawn(fun() -> B() end) || _ <- lists:seq(1, 100)].
This is what the network reports on node-a: dstat -n 1 10
-net/total-
recv send
0 0
676k 756M
643k 767M
584k 679M
693k 777M
648k 745M
660k 745M
667k 772M
651k 709M
675k 782M
688k 819M
That roughly translates to 6 Gbit/s. In other words, the Erlang distribution link between node-a & node-b is maxing out at around ~6 Gbit/s. Erlang distribution is limited to 27% of what we are measuring consistently and repeatedly outside of Erlang. In other words, iperf is 3.6x faster than an Erlang distribution link. It gets better. If we start another 100 processes pumping 10Mbyte messages from node-a to node-c, we see the network throughput double: dstat -n 1 10
-net/total-
recv send
0 0
1303k 1463M
1248k 1360M
1332k 1458M
1480k 1569M
1339k 1455M
1413k 1494M
1395k 1431M
1359k 1514M
1438k 1564M
1379k 1489M
So 2 distribution links - each to a separate node - utilise 12Gbit/s out of the 22Gbit/s available on node-a. What is preventing the distribution links pushing more data through? There is plenty of CPU & memory available (all nodes have 16 CPUs & 104GB MEM - n1-highmem-16): dstat -cm 1 10
----total-cpu-usage---- ------memory-usage-----
usr sys idl wai hiq siq| used buff cach free
10 6 84 0 0 1|16.3G 118M 284M 85.6G
20 6 73 0 0 1|16.3G 118M 284M 85.6G
20 6 74 0 0 0|16.3G 118M 284M 85.6G
18 6 76 0 0 0|16.4G 118M 284M 85.5G
19 6 74 0 0 1|16.4G 118M 284M 85.4G
17 4 78 0 0 0|16.5G 118M 284M 85.4G
20 6 74 0 0 0|16.5G 118M 284M 85.4G
19 6 74 0 0 0|16.5G 118M 284M 85.4G
19 5 76 0 0 1|16.5G 118M 284M 85.4G
18 6 75 0 0 0|16.5G 118M 284M 85.4G
18 6 75 0 0 0|16.6G 118M 284M 85.3G
The only smoking gun is the distribution output queue buffer: https://grafana.gcp.rabbitmq.com/dashboard/snapshot/H329EfN3SFhsveA20ei7jC7JMFHAm8Ru?orgId=1&fullscreen&panelId=62 Speaking of which, we look forward to erlang/otp#2270 being merged: https://github.com/erlang/otp/pull/2270 All distribution metrics are available here: https://grafana.gcp.rabbitmq.com/dashboard/snapshot/H329EfN3SFhsveA20ei7jC7JMFHAm8Ru?orgId=1 If you want to see the state of distribution links & dist process state (they are all green btw), check the point-in-time metrics (they will expire in 15 days from today): https://grafana.gcp.rabbitmq.com/d/d-SFCCmZz/erlang-distribution?from=1560775955127&to=1560779424482 How can we tell what is preventing the distribution link from using all available bandwidth? Are we missing a configuration flag? These are all the relevant beam.smp flags that we are using: https://github.com/erlang/otp/pull/2270#issuecomment-500953352
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
I wouldn't expect the Erlang distribution to reach the same network performance as iperf, but I would expect it to be within 70%-80% of maximum.
In our measurements it's within 27% of maximum, which makes me believe that something is misconfigured or inefficient.
The goal is to figure out which component/components are responsible for this significant network throughput loss.
Thanks for the quick response! On Mon, Jun 17, 2019 at 4:02 PM Dmytro Lytovchenko < [hidden email]> wrote: I believe the Erlang distribution is the wrong thing to use if you want to saturate the network. There is plenty of overhead for each incoming message, the data gets copied, then encoded (copied again) then sent, then received (copied), then decoded (copied again) and sent to the destination process (copied again). Then the receiving processes might be slow to fetch the incoming data, they aren't running in hard real time and sometimes go to sleep.
I remember there were suggestions to use regular TCP connections, consider using user-mode driver (kernel calls have a cost) and low level NIF driver for that, with the intent of delivering highest gigabits from your hardware.
Hi, We are trying to understand what prevents the Erlang distribution link from saturating the network. Even though there is plenty of CPU, memory & network bandwidth, the Erlang distribution doesn't fully utilise available resources. Can you help us figure out why? We have a 3-node Erlang 22.0.2 cluster running on Ubuntu 16.04 x86 64bit. This is the maximum network throughput between node-a & node-b, as measured by iperf: iperf -t 30 -c node-b
------------------------------------------------------------
Client connecting to 10.0.1.37, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[ 3] local 10.0.1.36 port 43576 connected with 10.0.1.37 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-30.0 sec 78.8 GBytes 22.6 Gbits/sec
We ran this multiple times, in different directions & with different degree of parallelism, the maximum network throughput is roughly 22 Gbit/s. We run the following command on node-a: B = fun F() -> rpc:cast('foo@node-b', erlang, is_binary, [<<0:10000000/unit:8>>]), F() end.
[spawn(fun() -> B() end) || _ <- lists:seq(1, 100)].
This is what the network reports on node-a: dstat -n 1 10
-net/total-
recv send
0 0
676k 756M
643k 767M
584k 679M
693k 777M
648k 745M
660k 745M
667k 772M
651k 709M
675k 782M
688k 819M
That roughly translates to 6 Gbit/s. In other words, the Erlang distribution link between node-a & node-b is maxing out at around ~6 Gbit/s. Erlang distribution is limited to 27% of what we are measuring consistently and repeatedly outside of Erlang. In other words, iperf is 3.6x faster than an Erlang distribution link. It gets better. If we start another 100 processes pumping 10Mbyte messages from node-a to node-c, we see the network throughput double: dstat -n 1 10
-net/total-
recv send
0 0
1303k 1463M
1248k 1360M
1332k 1458M
1480k 1569M
1339k 1455M
1413k 1494M
1395k 1431M
1359k 1514M
1438k 1564M
1379k 1489M
So 2 distribution links - each to a separate node - utilise 12Gbit/s out of the 22Gbit/s available on node-a. What is preventing the distribution links pushing more data through? There is plenty of CPU & memory available (all nodes have 16 CPUs & 104GB MEM - n1-highmem-16): dstat -cm 1 10
----total-cpu-usage---- ------memory-usage-----
usr sys idl wai hiq siq| used buff cach free
10 6 84 0 0 1|16.3G 118M 284M 85.6G
20 6 73 0 0 1|16.3G 118M 284M 85.6G
20 6 74 0 0 0|16.3G 118M 284M 85.6G
18 6 76 0 0 0|16.4G 118M 284M 85.5G
19 6 74 0 0 1|16.4G 118M 284M 85.4G
17 4 78 0 0 0|16.5G 118M 284M 85.4G
20 6 74 0 0 0|16.5G 118M 284M 85.4G
19 6 74 0 0 0|16.5G 118M 284M 85.4G
19 5 76 0 0 1|16.5G 118M 284M 85.4G
18 6 75 0 0 0|16.5G 118M 284M 85.4G
18 6 75 0 0 0|16.6G 118M 284M 85.3G
The only smoking gun is the distribution output queue buffer: https://grafana.gcp.rabbitmq.com/dashboard/snapshot/H329EfN3SFhsveA20ei7jC7JMFHAm8Ru?orgId=1&fullscreen&panelId=62 Speaking of which, we look forward to erlang/otp#2270 being merged: https://github.com/erlang/otp/pull/2270 All distribution metrics are available here: https://grafana.gcp.rabbitmq.com/dashboard/snapshot/H329EfN3SFhsveA20ei7jC7JMFHAm8Ru?orgId=1 If you want to see the state of distribution links & dist process state (they are all green btw), check the point-in-time metrics (they will expire in 15 days from today): https://grafana.gcp.rabbitmq.com/d/d-SFCCmZz/erlang-distribution?from=1560775955127&to=1560779424482 How can we tell what is preventing the distribution link from using all available bandwidth? Are we missing a configuration flag? These are all the relevant beam.smp flags that we are using: https://github.com/erlang/otp/pull/2270#issuecomment-500953352
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
Consider that the VM copies the data at least 4 or 5 times, and compare that with Gbit/s of your RAM on both servers too. Plus eventual garbage collection, which can be minimal if your VM memory footprint is small and you're only doing this perf testing.
I wouldn't expect the Erlang distribution to reach the same network performance as iperf, but I would expect it to be within 70%-80% of maximum.
In our measurements it's within 27% of maximum, which makes me believe that something is misconfigured or inefficient.
The goal is to figure out which component/components are responsible for this significant network throughput loss.
Thanks for the quick response!
On Mon, Jun 17, 2019 at 4:02 PM Dmytro Lytovchenko < [hidden email]> wrote: I believe the Erlang distribution is the wrong thing to use if you want to saturate the network. There is plenty of overhead for each incoming message, the data gets copied, then encoded (copied again) then sent, then received (copied), then decoded (copied again) and sent to the destination process (copied again). Then the receiving processes might be slow to fetch the incoming data, they aren't running in hard real time and sometimes go to sleep.
I remember there were suggestions to use regular TCP connections, consider using user-mode driver (kernel calls have a cost) and low level NIF driver for that, with the intent of delivering highest gigabits from your hardware.
Hi, We are trying to understand what prevents the Erlang distribution link from saturating the network. Even though there is plenty of CPU, memory & network bandwidth, the Erlang distribution doesn't fully utilise available resources. Can you help us figure out why? We have a 3-node Erlang 22.0.2 cluster running on Ubuntu 16.04 x86 64bit. This is the maximum network throughput between node-a & node-b, as measured by iperf: iperf -t 30 -c node-b
------------------------------------------------------------
Client connecting to 10.0.1.37, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[ 3] local 10.0.1.36 port 43576 connected with 10.0.1.37 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-30.0 sec 78.8 GBytes 22.6 Gbits/sec
We ran this multiple times, in different directions & with different degree of parallelism, the maximum network throughput is roughly 22 Gbit/s. We run the following command on node-a: B = fun F() -> rpc:cast('foo@node-b', erlang, is_binary, [<<0:10000000/unit:8>>]), F() end.
[spawn(fun() -> B() end) || _ <- lists:seq(1, 100)].
This is what the network reports on node-a: dstat -n 1 10
-net/total-
recv send
0 0
676k 756M
643k 767M
584k 679M
693k 777M
648k 745M
660k 745M
667k 772M
651k 709M
675k 782M
688k 819M
That roughly translates to 6 Gbit/s. In other words, the Erlang distribution link between node-a & node-b is maxing out at around ~6 Gbit/s. Erlang distribution is limited to 27% of what we are measuring consistently and repeatedly outside of Erlang. In other words, iperf is 3.6x faster than an Erlang distribution link. It gets better. If we start another 100 processes pumping 10Mbyte messages from node-a to node-c, we see the network throughput double: dstat -n 1 10
-net/total-
recv send
0 0
1303k 1463M
1248k 1360M
1332k 1458M
1480k 1569M
1339k 1455M
1413k 1494M
1395k 1431M
1359k 1514M
1438k 1564M
1379k 1489M
So 2 distribution links - each to a separate node - utilise 12Gbit/s out of the 22Gbit/s available on node-a. What is preventing the distribution links pushing more data through? There is plenty of CPU & memory available (all nodes have 16 CPUs & 104GB MEM - n1-highmem-16): dstat -cm 1 10
----total-cpu-usage---- ------memory-usage-----
usr sys idl wai hiq siq| used buff cach free
10 6 84 0 0 1|16.3G 118M 284M 85.6G
20 6 73 0 0 1|16.3G 118M 284M 85.6G
20 6 74 0 0 0|16.3G 118M 284M 85.6G
18 6 76 0 0 0|16.4G 118M 284M 85.5G
19 6 74 0 0 1|16.4G 118M 284M 85.4G
17 4 78 0 0 0|16.5G 118M 284M 85.4G
20 6 74 0 0 0|16.5G 118M 284M 85.4G
19 6 74 0 0 0|16.5G 118M 284M 85.4G
19 5 76 0 0 1|16.5G 118M 284M 85.4G
18 6 75 0 0 0|16.5G 118M 284M 85.4G
18 6 75 0 0 0|16.6G 118M 284M 85.3G
The only smoking gun is the distribution output queue buffer: https://grafana.gcp.rabbitmq.com/dashboard/snapshot/H329EfN3SFhsveA20ei7jC7JMFHAm8Ru?orgId=1&fullscreen&panelId=62 Speaking of which, we look forward to erlang/otp#2270 being merged: https://github.com/erlang/otp/pull/2270 All distribution metrics are available here: https://grafana.gcp.rabbitmq.com/dashboard/snapshot/H329EfN3SFhsveA20ei7jC7JMFHAm8Ru?orgId=1 If you want to see the state of distribution links & dist process state (they are all green btw), check the point-in-time metrics (they will expire in 15 days from today): https://grafana.gcp.rabbitmq.com/d/d-SFCCmZz/erlang-distribution?from=1560775955127&to=1560779424482 How can we tell what is preventing the distribution link from using all available bandwidth? Are we missing a configuration flag? These are all the relevant beam.smp flags that we are using: https://github.com/erlang/otp/pull/2270#issuecomment-500953352
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
Are you saying that due to the overhead associated with sending data over the distribution we are hitting the distribution link limit? We have measured this to be 6 Gbit/s. Can others confirm? On Mon, Jun 17, 2019 at 4:11 PM Dmytro Lytovchenko < [hidden email]> wrote: Consider that the VM copies the data at least 4 or 5 times, and compare that with Gbit/s of your RAM on both servers too. Plus eventual garbage collection, which can be minimal if your VM memory footprint is small and you're only doing this perf testing.
I wouldn't expect the Erlang distribution to reach the same network performance as iperf, but I would expect it to be within 70%-80% of maximum.
In our measurements it's within 27% of maximum, which makes me believe that something is misconfigured or inefficient.
The goal is to figure out which component/components are responsible for this significant network throughput loss.
Thanks for the quick response!
On Mon, Jun 17, 2019 at 4:02 PM Dmytro Lytovchenko < [hidden email]> wrote: I believe the Erlang distribution is the wrong thing to use if you want to saturate the network. There is plenty of overhead for each incoming message, the data gets copied, then encoded (copied again) then sent, then received (copied), then decoded (copied again) and sent to the destination process (copied again). Then the receiving processes might be slow to fetch the incoming data, they aren't running in hard real time and sometimes go to sleep.
I remember there were suggestions to use regular TCP connections, consider using user-mode driver (kernel calls have a cost) and low level NIF driver for that, with the intent of delivering highest gigabits from your hardware.
Hi, We are trying to understand what prevents the Erlang distribution link from saturating the network. Even though there is plenty of CPU, memory & network bandwidth, the Erlang distribution doesn't fully utilise available resources. Can you help us figure out why? We have a 3-node Erlang 22.0.2 cluster running on Ubuntu 16.04 x86 64bit. This is the maximum network throughput between node-a & node-b, as measured by iperf: iperf -t 30 -c node-b
------------------------------------------------------------
Client connecting to 10.0.1.37, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[ 3] local 10.0.1.36 port 43576 connected with 10.0.1.37 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-30.0 sec 78.8 GBytes 22.6 Gbits/sec
We ran this multiple times, in different directions & with different degree of parallelism, the maximum network throughput is roughly 22 Gbit/s. We run the following command on node-a: B = fun F() -> rpc:cast('foo@node-b', erlang, is_binary, [<<0:10000000/unit:8>>]), F() end.
[spawn(fun() -> B() end) || _ <- lists:seq(1, 100)].
This is what the network reports on node-a: dstat -n 1 10
-net/total-
recv send
0 0
676k 756M
643k 767M
584k 679M
693k 777M
648k 745M
660k 745M
667k 772M
651k 709M
675k 782M
688k 819M
That roughly translates to 6 Gbit/s. In other words, the Erlang distribution link between node-a & node-b is maxing out at around ~6 Gbit/s. Erlang distribution is limited to 27% of what we are measuring consistently and repeatedly outside of Erlang. In other words, iperf is 3.6x faster than an Erlang distribution link. It gets better. If we start another 100 processes pumping 10Mbyte messages from node-a to node-c, we see the network throughput double: dstat -n 1 10
-net/total-
recv send
0 0
1303k 1463M
1248k 1360M
1332k 1458M
1480k 1569M
1339k 1455M
1413k 1494M
1395k 1431M
1359k 1514M
1438k 1564M
1379k 1489M
So 2 distribution links - each to a separate node - utilise 12Gbit/s out of the 22Gbit/s available on node-a. What is preventing the distribution links pushing more data through? There is plenty of CPU & memory available (all nodes have 16 CPUs & 104GB MEM - n1-highmem-16): dstat -cm 1 10
----total-cpu-usage---- ------memory-usage-----
usr sys idl wai hiq siq| used buff cach free
10 6 84 0 0 1|16.3G 118M 284M 85.6G
20 6 73 0 0 1|16.3G 118M 284M 85.6G
20 6 74 0 0 0|16.3G 118M 284M 85.6G
18 6 76 0 0 0|16.4G 118M 284M 85.5G
19 6 74 0 0 1|16.4G 118M 284M 85.4G
17 4 78 0 0 0|16.5G 118M 284M 85.4G
20 6 74 0 0 0|16.5G 118M 284M 85.4G
19 6 74 0 0 0|16.5G 118M 284M 85.4G
19 5 76 0 0 1|16.5G 118M 284M 85.4G
18 6 75 0 0 0|16.5G 118M 284M 85.4G
18 6 75 0 0 0|16.6G 118M 284M 85.3G
The only smoking gun is the distribution output queue buffer: https://grafana.gcp.rabbitmq.com/dashboard/snapshot/H329EfN3SFhsveA20ei7jC7JMFHAm8Ru?orgId=1&fullscreen&panelId=62 Speaking of which, we look forward to erlang/otp#2270 being merged: https://github.com/erlang/otp/pull/2270 All distribution metrics are available here: https://grafana.gcp.rabbitmq.com/dashboard/snapshot/H329EfN3SFhsveA20ei7jC7JMFHAm8Ru?orgId=1 If you want to see the state of distribution links & dist process state (they are all green btw), check the point-in-time metrics (they will expire in 15 days from today): https://grafana.gcp.rabbitmq.com/d/d-SFCCmZz/erlang-distribution?from=1560775955127&to=1560779424482 How can we tell what is preventing the distribution link from using all available bandwidth? Are we missing a configuration flag? These are all the relevant beam.smp flags that we are using: https://github.com/erlang/otp/pull/2270#issuecomment-500953352
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
> Consider that the VM copies the data at least 4 or 5 times This is not good.
I'm designing multi-node hardware appliance for our Flussonic and thought that it would be a good idea to interconnect nodes via erlang distribution.
Seems that it would be better to create 2 channels: control and data?
On Mon, Jun 17, 2019 at 6:11 PM Dmytro Lytovchenko < [hidden email]> wrote: Consider that the VM copies the data at least 4 or 5 times, and compare that with Gbit/s of your RAM on both servers too. Plus eventual garbage collection, which can be minimal if your VM memory footprint is small and you're only doing this perf testing.
I wouldn't expect the Erlang distribution to reach the same network performance as iperf, but I would expect it to be within 70%-80% of maximum.
In our measurements it's within 27% of maximum, which makes me believe that something is misconfigured or inefficient.
The goal is to figure out which component/components are responsible for this significant network throughput loss.
Thanks for the quick response!
On Mon, Jun 17, 2019 at 4:02 PM Dmytro Lytovchenko < [hidden email]> wrote: I believe the Erlang distribution is the wrong thing to use if you want to saturate the network. There is plenty of overhead for each incoming message, the data gets copied, then encoded (copied again) then sent, then received (copied), then decoded (copied again) and sent to the destination process (copied again). Then the receiving processes might be slow to fetch the incoming data, they aren't running in hard real time and sometimes go to sleep.
I remember there were suggestions to use regular TCP connections, consider using user-mode driver (kernel calls have a cost) and low level NIF driver for that, with the intent of delivering highest gigabits from your hardware.
Hi, We are trying to understand what prevents the Erlang distribution link from saturating the network. Even though there is plenty of CPU, memory & network bandwidth, the Erlang distribution doesn't fully utilise available resources. Can you help us figure out why? We have a 3-node Erlang 22.0.2 cluster running on Ubuntu 16.04 x86 64bit. This is the maximum network throughput between node-a & node-b, as measured by iperf: iperf -t 30 -c node-b
------------------------------------------------------------
Client connecting to 10.0.1.37, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[ 3] local 10.0.1.36 port 43576 connected with 10.0.1.37 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-30.0 sec 78.8 GBytes 22.6 Gbits/sec
We ran this multiple times, in different directions & with different degree of parallelism, the maximum network throughput is roughly 22 Gbit/s. We run the following command on node-a: B = fun F() -> rpc:cast('foo@node-b', erlang, is_binary, [<<0:10000000/unit:8>>]), F() end.
[spawn(fun() -> B() end) || _ <- lists:seq(1, 100)].
This is what the network reports on node-a: dstat -n 1 10
-net/total-
recv send
0 0
676k 756M
643k 767M
584k 679M
693k 777M
648k 745M
660k 745M
667k 772M
651k 709M
675k 782M
688k 819M
That roughly translates to 6 Gbit/s. In other words, the Erlang distribution link between node-a & node-b is maxing out at around ~6 Gbit/s. Erlang distribution is limited to 27% of what we are measuring consistently and repeatedly outside of Erlang. In other words, iperf is 3.6x faster than an Erlang distribution link. It gets better. If we start another 100 processes pumping 10Mbyte messages from node-a to node-c, we see the network throughput double: dstat -n 1 10
-net/total-
recv send
0 0
1303k 1463M
1248k 1360M
1332k 1458M
1480k 1569M
1339k 1455M
1413k 1494M
1395k 1431M
1359k 1514M
1438k 1564M
1379k 1489M
So 2 distribution links - each to a separate node - utilise 12Gbit/s out of the 22Gbit/s available on node-a. What is preventing the distribution links pushing more data through? There is plenty of CPU & memory available (all nodes have 16 CPUs & 104GB MEM - n1-highmem-16): dstat -cm 1 10
----total-cpu-usage---- ------memory-usage-----
usr sys idl wai hiq siq| used buff cach free
10 6 84 0 0 1|16.3G 118M 284M 85.6G
20 6 73 0 0 1|16.3G 118M 284M 85.6G
20 6 74 0 0 0|16.3G 118M 284M 85.6G
18 6 76 0 0 0|16.4G 118M 284M 85.5G
19 6 74 0 0 1|16.4G 118M 284M 85.4G
17 4 78 0 0 0|16.5G 118M 284M 85.4G
20 6 74 0 0 0|16.5G 118M 284M 85.4G
19 6 74 0 0 0|16.5G 118M 284M 85.4G
19 5 76 0 0 1|16.5G 118M 284M 85.4G
18 6 75 0 0 0|16.5G 118M 284M 85.4G
18 6 75 0 0 0|16.6G 118M 284M 85.3G
The only smoking gun is the distribution output queue buffer: https://grafana.gcp.rabbitmq.com/dashboard/snapshot/H329EfN3SFhsveA20ei7jC7JMFHAm8Ru?orgId=1&fullscreen&panelId=62 Speaking of which, we look forward to erlang/otp#2270 being merged: https://github.com/erlang/otp/pull/2270 All distribution metrics are available here: https://grafana.gcp.rabbitmq.com/dashboard/snapshot/H329EfN3SFhsveA20ei7jC7JMFHAm8Ru?orgId=1 If you want to see the state of distribution links & dist process state (they are all green btw), check the point-in-time metrics (they will expire in 15 days from today): https://grafana.gcp.rabbitmq.com/d/d-SFCCmZz/erlang-distribution?from=1560775955127&to=1560779424482 How can we tell what is preventing the distribution link from using all available bandwidth? Are we missing a configuration flag? These are all the relevant beam.smp flags that we are using: https://github.com/erlang/otp/pull/2270#issuecomment-500953352
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
I've not tried using Partisan for the kind of single sender -> single receiver large blob throughput that y'all are attempting, but Partisan might make better use of high speed local networks in your case.
-Scott (standing in for Chris Meiklejohn ^_^)
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
I work with Gerhard (OP) and just thought I'd chime in.
Thanks for the alternative suggestions of using sockets / partisan. At some point we will look at this but for now we are working on making the best use possible of vanilla distributed erlang links. We do understand there is overhead and inherent inefficiencies with using distributed erlang for data transfer but isn't it surprising / concerning anyone else that it is only able to make use of 27% of total network potential? Would you not prefer it to be, say, 50% or even 75% if this is theoretically possible?
So what we'd like to find out is whether there are genuine unsurmountable reasons for this or something that can be improved, either by tweaking config or by optimisations in the distribution layer.
We will run more tests to get more data to reason about. If anyone has suggestions for good tests to run / metrics to inspect that would be very welcome.
Cheers Karl On Mon, 17 Jun 2019 at 23:01, Scott Lystig Fritchie < [hidden email]> wrote: I've not tried using Partisan for the kind of single sender -> single receiver large blob throughput that y'all are attempting, but Partisan might make better use of high speed local networks in your case.
-Scott (standing in for Chris Meiklejohn ^_^)
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
--
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
Have a look at your - Linux kernel TCP settings
- Memory throughput Gbit/s (the data is copied several times for sending and receiving)
- Head of the line problem (if receiver doesn't read fast enough, the memory will be used to store the unconsumed data)
- Message size might make some difference, OTP 22 splits packets over 64k into segments, so they also have to be stored before assembling in the destination
- Kernel CPU time vs User CPU time (network driver and kernel overhead)
I work with Gerhard (OP) and just thought I'd chime in.
Thanks for the alternative suggestions of using sockets / partisan. At some point we will look at this but for now we are working on making the best use possible of vanilla distributed erlang links. We do understand there is overhead and inherent inefficiencies with using distributed erlang for data transfer but isn't it surprising / concerning anyone else that it is only able to make use of 27% of total network potential? Would you not prefer it to be, say, 50% or even 75% if this is theoretically possible?
So what we'd like to find out is whether there are genuine unsurmountable reasons for this or something that can be improved, either by tweaking config or by optimisations in the distribution layer.
We will run more tests to get more data to reason about. If anyone has suggestions for good tests to run / metrics to inspect that would be very welcome.
Cheers Karl
On Mon, 17 Jun 2019 at 23:01, Scott Lystig Fritchie < [hidden email]> wrote: I've not tried using Partisan for the kind of single sender -> single receiver large blob throughput that y'all are attempting, but Partisan might make better use of high speed local networks in your case.
-Scott (standing in for Chris Meiklejohn ^_^)
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
--
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
B = fun F() -> rpc:cast('foo@node-b', erlang, is_binary, [<<0:10000000/unit:8>>]), F() end.
[spawn(fun() -> B() end) || _ <- lists:seq(1, 100)].
I wrote this code that is better able to saturate the network. Using rpc will create bottlenecks on the receiving side as it is the receiving process that does the decode. I get about 20% more traffic through when I do that, which is not as much as I was expecting.
-module(dist_perf).
-export([go/0,go/1]).
go() -> go(100). go(N) -> spawn_link(fun stats/0), RemoteNode = 'bar@elxd3291v0k', Pids = [spawn(RemoteNode, fun F() -> receive {From, Msg} -> From ! is_binary(Msg), F() end end) || _<- lists:seq(1,N)], Payload = <<0:10000000/unit:8>>, B = fun F(Pid) -> Pid ! {self(),Payload}, receive true -> F(Pid) end end, [spawn(fun() -> B(Pid) end) || Pid <- Pids ].
stats() -> {{_input,T0Inp},{_output,T0Out}} = erlang:statistics(io), timer:sleep(5000), {{_input,T1Inp},{_output,T1Out}} = erlang:statistics(io), io:format("Sent ~pMB/s, Recv ~pMB/s~n", [(T1Out - T0Out) / 1024 / 1024 / 5, (T1Inp - T0Inp) / 1024 / 1024 / 5]), stats().
The reason why the re-assembly has to be done is that the emulator does not have a variant of erlang:binary_to_term/1 that takes an iovector as input. This could definitely be fixed and is something that we are planning on fixing, though it is not something we are working on right now. However, it will only change things in production if you send > 64KB messages.
Also, just to be clear, the copying that is done by the distribution layer is:
1) use term_to_binary to create a buffer 2) use writev in inet_driver to copy buffer to kernel 3) use recv to copy data from kernel into a refc binary 4) if needed, use memcpy re-assemble fragments 5) use binary_to_term to decode term
(there is no copy to the destination process as Dmyto said)
Before OTP-22 there was an additional copy at the receiving side, but that has been removed. So imo there is only one copy that could potentially be removed.
If you want to decrease the overhead of saturating the network with many small messages I would implement something like read_packets for tcp in the inet_driver. However, the real bottleneck quickly becomes memory allocations as each message needs two or three allocations, and when you receive a lot of messages those allocations add up.
Lukas
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
Hi, We are trying to understand what prevents the Erlang distribution link from saturating the network. Even though there is plenty of CPU, memory & network bandwidth, the Erlang distribution
You need to instrument. Try running something like perf(1) on the emulator and look where it is spending its time. This will be able to either confirm or reject Dmytro's viewpoint that you are looking at excessive copying.
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
Some great replies, thank you all.
I'm putting it all together now and re-running on my end.
Will update as I capture various observations. On Tue, Jun 18, 2019 at 11:07 AM Lukas Larsson < [hidden email]> wrote:
B = fun F() -> rpc:cast('foo@node-b', erlang, is_binary, [<<0:10000000/unit:8>>]), F() end.
[spawn(fun() -> B() end) || _ <- lists:seq(1, 100)].
I wrote this code that is better able to saturate the network. Using rpc will create bottlenecks on the receiving side as it is the receiving process that does the decode. I get about 20% more traffic through when I do that, which is not as much as I was expecting.
-module(dist_perf).
-export([go/0,go/1]).
go() -> go(100). go(N) -> spawn_link(fun stats/0), RemoteNode = 'bar@elxd3291v0k', Pids = [spawn(RemoteNode, fun F() -> receive {From, Msg} -> From ! is_binary(Msg), F() end end) || _<- lists:seq(1,N)], Payload = <<0:10000000/unit:8>>, B = fun F(Pid) -> Pid ! {self(),Payload}, receive true -> F(Pid) end end, [spawn(fun() -> B(Pid) end) || Pid <- Pids ].
stats() -> {{_input,T0Inp},{_output,T0Out}} = erlang:statistics(io), timer:sleep(5000), {{_input,T1Inp},{_output,T1Out}} = erlang:statistics(io), io:format("Sent ~pMB/s, Recv ~pMB/s~n", [(T1Out - T0Out) / 1024 / 1024 / 5, (T1Inp - T0Inp) / 1024 / 1024 / 5]), stats().
The reason why the re-assembly has to be done is that the emulator does not have a variant of erlang:binary_to_term/1 that takes an iovector as input. This could definitely be fixed and is something that we are planning on fixing, though it is not something we are working on right now. However, it will only change things in production if you send > 64KB messages.
Also, just to be clear, the copying that is done by the distribution layer is:
1) use term_to_binary to create a buffer 2) use writev in inet_driver to copy buffer to kernel 3) use recv to copy data from kernel into a refc binary 4) if needed, use memcpy re-assemble fragments 5) use binary_to_term to decode term
(there is no copy to the destination process as Dmyto said)
Before OTP-22 there was an additional copy at the receiving side, but that has been removed. So imo there is only one copy that could potentially be removed.
If you want to decrease the overhead of saturating the network with many small messages I would implement something like read_packets for tcp in the inet_driver. However, the real bottleneck quickly becomes memory allocations as each message needs two or three allocations, and when you receive a lot of messages those allocations add up.
Lukas
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
TL;DR Single distribution link was measured to peak at 14.4 Gbit/s, which is ~50% from network maximum, 28.5 Gbit/s.
I'm confirming Lukas' observations: using smaller payload sizes can result in up to 14.4 Gbit/s throughput per distribution link.
Payloads of both 60KB & 120KB yield the same max throughput of 14.4 Gbit/s.
When message payloads go from 1MB to 4MB, throughput drops from 12.8 Gbit/s to 9.6 Gbit/s.
With 10MB payloads, distribution link maxes out at 7.6 Gbit/s. This is interesting because yesterday I couldn't get it above 6 Gbit/s. Since this is GCP, on paper each CPU gets 2 Gbit/s. At 16 CPUs we should be maxing out at 32 Gbit/s. I have just benchmarked with iperf, and today these fresh VMs are peaking at 28.5 Gbit/s (yesterday they were 22.5 Gbit/s).
At 100MB payloads, distribution link makes out at 4.5 Gbit/s.
On Tue, Jun 18, 2019 at 11:07 AM Lukas Larsson < [hidden email]> wrote:
B = fun F() -> rpc:cast('foo@node-b', erlang, is_binary, [<<0:10000000/unit:8>>]), F() end.
[spawn(fun() -> B() end) || _ <- lists:seq(1, 100)].
I wrote this code that is better able to saturate the network. Using rpc will create bottlenecks on the receiving side as it is the receiving process that does the decode. I get about 20% more traffic through when I do that, which is not as much as I was expecting.
-module(dist_perf).
-export([go/0,go/1]).
go() -> go(100). go(N) -> spawn_link(fun stats/0), RemoteNode = 'bar@elxd3291v0k', Pids = [spawn(RemoteNode, fun F() -> receive {From, Msg} -> From ! is_binary(Msg), F() end end) || _<- lists:seq(1,N)], Payload = <<0:10000000/unit:8>>, B = fun F(Pid) -> Pid ! {self(),Payload}, receive true -> F(Pid) end end, [spawn(fun() -> B(Pid) end) || Pid <- Pids ].
stats() -> {{_input,T0Inp},{_output,T0Out}} = erlang:statistics(io), timer:sleep(5000), {{_input,T1Inp},{_output,T1Out}} = erlang:statistics(io), io:format("Sent ~pMB/s, Recv ~pMB/s~n", [(T1Out - T0Out) / 1024 / 1024 / 5, (T1Inp - T0Inp) / 1024 / 1024 / 5]), stats().
The reason why the re-assembly has to be done is that the emulator does not have a variant of erlang:binary_to_term/1 that takes an iovector as input. This could definitely be fixed and is something that we are planning on fixing, though it is not something we are working on right now. However, it will only change things in production if you send > 64KB messages.
Also, just to be clear, the copying that is done by the distribution layer is:
1) use term_to_binary to create a buffer 2) use writev in inet_driver to copy buffer to kernel 3) use recv to copy data from kernel into a refc binary 4) if needed, use memcpy re-assemble fragments 5) use binary_to_term to decode term
(there is no copy to the destination process as Dmyto said)
Before OTP-22 there was an additional copy at the receiving side, but that has been removed. So imo there is only one copy that could potentially be removed.
If you want to decrease the overhead of saturating the network with many small messages I would implement something like read_packets for tcp in the inet_driver. However, the real bottleneck quickly becomes memory allocations as each message needs two or three allocations, and when you receive a lot of messages those allocations add up.
Lukas
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
They added an additional 20-25% for me. TL;DR Single distribution link was measured to peak at 14.4 Gbit/s, which is ~50% from network maximum, 28.5 Gbit/s.
I'm confirming Lukas' observations: using smaller payload sizes can result in up to 14.4 Gbit/s throughput per distribution link.
Payloads of both 60KB & 120KB yield the same max throughput of 14.4 Gbit/s.
When message payloads go from 1MB to 4MB, throughput drops from 12.8 Gbit/s to 9.6 Gbit/s.
With 10MB payloads, distribution link maxes out at 7.6 Gbit/s. This is interesting because yesterday I couldn't get it above 6 Gbit/s. Since this is GCP, on paper each CPU gets 2 Gbit/s. At 16 CPUs we should be maxing out at 32 Gbit/s. I have just benchmarked with iperf, and today these fresh VMs are peaking at 28.5 Gbit/s (yesterday they were 22.5 Gbit/s).
At 100MB payloads, distribution link makes out at 4.5 Gbit/s.
On Tue, Jun 18, 2019 at 11:07 AM Lukas Larsson < [hidden email]> wrote:
B = fun F() -> rpc:cast('foo@node-b', erlang, is_binary, [<<0:10000000/unit:8>>]), F() end.
[spawn(fun() -> B() end) || _ <- lists:seq(1, 100)].
I wrote this code that is better able to saturate the network. Using rpc will create bottlenecks on the receiving side as it is the receiving process that does the decode. I get about 20% more traffic through when I do that, which is not as much as I was expecting.
-module(dist_perf).
-export([go/0,go/1]).
go() -> go(100). go(N) -> spawn_link(fun stats/0), RemoteNode = 'bar@elxd3291v0k', Pids = [spawn(RemoteNode, fun F() -> receive {From, Msg} -> From ! is_binary(Msg), F() end end) || _<- lists:seq(1,N)], Payload = <<0:10000000/unit:8>>, B = fun F(Pid) -> Pid ! {self(),Payload}, receive true -> F(Pid) end end, [spawn(fun() -> B(Pid) end) || Pid <- Pids ].
stats() -> {{_input,T0Inp},{_output,T0Out}} = erlang:statistics(io), timer:sleep(5000), {{_input,T1Inp},{_output,T1Out}} = erlang:statistics(io), io:format("Sent ~pMB/s, Recv ~pMB/s~n", [(T1Out - T0Out) / 1024 / 1024 / 5, (T1Inp - T0Inp) / 1024 / 1024 / 5]), stats().
The reason why the re-assembly has to be done is that the emulator does not have a variant of erlang:binary_to_term/1 that takes an iovector as input. This could definitely be fixed and is something that we are planning on fixing, though it is not something we are working on right now. However, it will only change things in production if you send > 64KB messages.
Also, just to be clear, the copying that is done by the distribution layer is:
1) use term_to_binary to create a buffer 2) use writev in inet_driver to copy buffer to kernel 3) use recv to copy data from kernel into a refc binary 4) if needed, use memcpy re-assemble fragments 5) use binary_to_term to decode term
(there is no copy to the destination process as Dmyto said)
Before OTP-22 there was an additional copy at the receiving side, but that has been removed. So imo there is only one copy that could potentially be removed.
If you want to decrease the overhead of saturating the network with many small messages I would implement something like read_packets for tcp in the inet_driver. However, the real bottleneck quickly becomes memory allocations as each message needs two or three allocations, and when you receive a lot of messages those allocations add up.
Lukas
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
https://github.com/erlang/otp/pull/2291 is an improvement for larger payloads, not so much for smaller ones.
60 KB - 13.6 Gbit/s (down from 14.4 Gbit/s) 120 KB - 12.8 Gbit/s (down from 14.4 Gbit/s) 10MB - 8.8 Gbit/s (up from 7.6 Gbit/s) 100MB - 5.8 Gbit/s (up from 4.3 Gbit/s)
Someone's on a roll today!
Pulling the patch now and giving it a run for its money.
Would it be worth benchmarking OTP v21.3.8.4?
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
On 6/18/19 2:14 PM, Gerhard Lazu wrote:
You should also try using the erl command line option +zdbbl when
testing larger payloads (it defaults to 1024 KB with valid values from 1
- 2097151 KB).
Best Regards,
Michael
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
https://github.com/erlang/otp/pull/2291 is an improvement for larger payloads, not so much for smaller ones.
60 KB - 13.6 Gbit/s (down from 14.4 Gbit/s) 120 KB - 12.8 Gbit/s (down from 14.4 Gbit/s) 10MB - 8.8 Gbit/s (up from 7.6 Gbit/s) 100MB - 5.8 Gbit/s (up from 4.3 Gbit/s)
Just keep in mind that these optimizations work great for when you are sending large binary terms. If you were to send the same data but as a list, you are back where you started again. So as always when you do benchmarks, make sure that you are testing with data that is close to what you expect your application to actually send.
Lukas
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|