Heavy duty UDP server performance

classic Classic list List threaded Threaded
26 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Heavy duty UDP server performance

Ameretat Reith
I'm playing with Erlang to make an experimental protocol.  I'm trying
to make it use full of 1Gbit link but It won't scale that much and I'm
failing to found a bottleneck in my code or even anything I could call
it bottleneck.

My software is very like a messaging server software in behavior, with
bigger packets, many clients (more than 4k) and uses more complex
sub-components, like a distributed database but those components are
not blocking other portions of system;  It's just the client-server
channel that is heavy IO and involve some encryption and decryption.

I made a gen_server process for each UDP socket to clients.  There is a
central process registry but It being called just for new clients and
Its message queue is often empty.

I found there was a bottleneck in `scheduler_wait` when I had few
clients (around 400) and It consumed around 50% of total CPU usage.  I
found an old patch by Wei Cao [1] which seemed to target same issue.
But on a modern version of Erlang (18.0) blockage in `scheduler_wait`
dropped well in more congested network, specifically to around 10%
when my software reached Its apparent limit, around 600Mbit/s read and
write to network. At this point my incoming UDP packet rate is around
24K/s. Maybe an experienced Erlang developer here can remember that
problem and can tell whether Erlang is now optimized to poll for
network packets more often or not..

I also concerned async pool since there was fairly high work in Erlang
work with pthread but found those threads just used for file IO
operations.  I didn't found any assuring documentation about this, just
saw the only user of this dirty IO thing is `io.c` in otp source code.
I'm very grateful if anyone clear the usage and effect of this pool.


I made flame graphs of function calls both inside VM (using eflame2
[2]) which is very even and cannot find any outstanding usage [3]. And
made another flamegraph of perf report outside of VM which cannot find
some symbols [4].  I doubt whether process_main shoud take that much
work itself or not.  Apparently encryption and decryption (enacl_nif
calls) didn't take much time too.

Do you have any suggestion for me to analyze better my software and
understand VM working?  Is It those limits I should expect and there is
not more room for optimizations?

Thanks in advance

1: http://erlang.org/pipermail/erlang-questions/2012-July/067868.html
2: https://github.com/slfritchie/eflame
3: http://file.reith.ir/in-erl-3k.gif
4: http://file.reith.ir/out-erl-perf.svg (interactive, use web browser)
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Heavy duty UDP server performance

Chandru-4
Hi,

Can you share some information about how you are setting up your UDP sockets? Buffer sizes, active mode etc? Are all 400 clients sending to the same UDP socket?

cheers,
Chandru

On 1 February 2016 at 19:40, Ameretat Reith <[hidden email]> wrote:
I'm playing with Erlang to make an experimental protocol.  I'm trying
to make it use full of 1Gbit link but It won't scale that much and I'm
failing to found a bottleneck in my code or even anything I could call
it bottleneck.

My software is very like a messaging server software in behavior, with
bigger packets, many clients (more than 4k) and uses more complex
sub-components, like a distributed database but those components are
not blocking other portions of system;  It's just the client-server
channel that is heavy IO and involve some encryption and decryption.

I made a gen_server process for each UDP socket to clients.  There is a
central process registry but It being called just for new clients and
Its message queue is often empty.

I found there was a bottleneck in `scheduler_wait` when I had few
clients (around 400) and It consumed around 50% of total CPU usage.  I
found an old patch by Wei Cao [1] which seemed to target same issue.
But on a modern version of Erlang (18.0) blockage in `scheduler_wait`
dropped well in more congested network, specifically to around 10%
when my software reached Its apparent limit, around 600Mbit/s read and
write to network. At this point my incoming UDP packet rate is around
24K/s. Maybe an experienced Erlang developer here can remember that
problem and can tell whether Erlang is now optimized to poll for
network packets more often or not..

I also concerned async pool since there was fairly high work in Erlang
work with pthread but found those threads just used for file IO
operations.  I didn't found any assuring documentation about this, just
saw the only user of this dirty IO thing is `io.c` in otp source code.
I'm very grateful if anyone clear the usage and effect of this pool.


I made flame graphs of function calls both inside VM (using eflame2
[2]) which is very even and cannot find any outstanding usage [3]. And
made another flamegraph of perf report outside of VM which cannot find
some symbols [4].  I doubt whether process_main shoud take that much
work itself or not.  Apparently encryption and decryption (enacl_nif
calls) didn't take much time too.

Do you have any suggestion for me to analyze better my software and
understand VM working?  Is It those limits I should expect and there is
not more room for optimizations?

Thanks in advance

1: http://erlang.org/pipermail/erlang-questions/2012-July/067868.html
2: https://github.com/slfritchie/eflame
3: http://file.reith.ir/in-erl-3k.gif
4: http://file.reith.ir/out-erl-perf.svg (interactive, use web browser)
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Heavy duty UDP server performance

Sergej Jurečko
In reply to this post by Ameretat Reith
UDP performance in erlang is not that good. If I were you I would write a NIF. UDP is relatively simple to work with in C/C++.

Sergej

On Mon, Feb 1, 2016 at 8:40 PM, Ameretat Reith <[hidden email]> wrote:
I'm playing with Erlang to make an experimental protocol.  I'm trying
to make it use full of 1Gbit link but It won't scale that much and I'm
failing to found a bottleneck in my code or even anything I could call
it bottleneck.

My software is very like a messaging server software in behavior, with
bigger packets, many clients (more than 4k) and uses more complex
sub-components, like a distributed database but those components are
not blocking other portions of system;  It's just the client-server
channel that is heavy IO and involve some encryption and decryption.

I made a gen_server process for each UDP socket to clients.  There is a
central process registry but It being called just for new clients and
Its message queue is often empty.

I found there was a bottleneck in `scheduler_wait` when I had few
clients (around 400) and It consumed around 50% of total CPU usage.  I
found an old patch by Wei Cao [1] which seemed to target same issue.
But on a modern version of Erlang (18.0) blockage in `scheduler_wait`
dropped well in more congested network, specifically to around 10%
when my software reached Its apparent limit, around 600Mbit/s read and
write to network. At this point my incoming UDP packet rate is around
24K/s. Maybe an experienced Erlang developer here can remember that
problem and can tell whether Erlang is now optimized to poll for
network packets more often or not..

I also concerned async pool since there was fairly high work in Erlang
work with pthread but found those threads just used for file IO
operations.  I didn't found any assuring documentation about this, just
saw the only user of this dirty IO thing is `io.c` in otp source code.
I'm very grateful if anyone clear the usage and effect of this pool.


I made flame graphs of function calls both inside VM (using eflame2
[2]) which is very even and cannot find any outstanding usage [3]. And
made another flamegraph of perf report outside of VM which cannot find
some symbols [4].  I doubt whether process_main shoud take that much
work itself or not.  Apparently encryption and decryption (enacl_nif
calls) didn't take much time too.

Do you have any suggestion for me to analyze better my software and
understand VM working?  Is It those limits I should expect and there is
not more room for optimizations?

Thanks in advance

1: http://erlang.org/pipermail/erlang-questions/2012-July/067868.html
2: https://github.com/slfritchie/eflame
3: http://file.reith.ir/in-erl-3k.gif
4: http://file.reith.ir/out-erl-perf.svg (interactive, use web browser)
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Heavy duty UDP server performance

Sean Cribbs
On what basis do you make that claim? Also, writing a NIF that actually provides better performance without blocking the scheduler is non-trivial, even if UDP is simple to work with in C.

On Wed, Feb 3, 2016 at 2:52 AM, Sergej Jurečko <[hidden email]> wrote:
UDP performance in erlang is not that good. If I were you I would write a NIF. UDP is relatively simple to work with in C/C++.

Sergej

On Mon, Feb 1, 2016 at 8:40 PM, Ameretat Reith <[hidden email]> wrote:
I'm playing with Erlang to make an experimental protocol.  I'm trying
to make it use full of 1Gbit link but It won't scale that much and I'm
failing to found a bottleneck in my code or even anything I could call
it bottleneck.

My software is very like a messaging server software in behavior, with
bigger packets, many clients (more than 4k) and uses more complex
sub-components, like a distributed database but those components are
not blocking other portions of system;  It's just the client-server
channel that is heavy IO and involve some encryption and decryption.

I made a gen_server process for each UDP socket to clients.  There is a
central process registry but It being called just for new clients and
Its message queue is often empty.

I found there was a bottleneck in `scheduler_wait` when I had few
clients (around 400) and It consumed around 50% of total CPU usage.  I
found an old patch by Wei Cao [1] which seemed to target same issue.
But on a modern version of Erlang (18.0) blockage in `scheduler_wait`
dropped well in more congested network, specifically to around 10%
when my software reached Its apparent limit, around 600Mbit/s read and
write to network. At this point my incoming UDP packet rate is around
24K/s. Maybe an experienced Erlang developer here can remember that
problem and can tell whether Erlang is now optimized to poll for
network packets more often or not..

I also concerned async pool since there was fairly high work in Erlang
work with pthread but found those threads just used for file IO
operations.  I didn't found any assuring documentation about this, just
saw the only user of this dirty IO thing is `io.c` in otp source code.
I'm very grateful if anyone clear the usage and effect of this pool.


I made flame graphs of function calls both inside VM (using eflame2
[2]) which is very even and cannot find any outstanding usage [3]. And
made another flamegraph of perf report outside of VM which cannot find
some symbols [4].  I doubt whether process_main shoud take that much
work itself or not.  Apparently encryption and decryption (enacl_nif
calls) didn't take much time too.

Do you have any suggestion for me to analyze better my software and
understand VM working?  Is It those limits I should expect and there is
not more room for optimizations?

Thanks in advance

1: http://erlang.org/pipermail/erlang-questions/2012-July/067868.html
2: https://github.com/slfritchie/eflame
3: http://file.reith.ir/in-erl-3k.gif
4: http://file.reith.ir/out-erl-perf.svg (interactive, use web browser)
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Heavy duty UDP server performance

Max Lapshin-2
When we use plain gen_udp to accept 200-400 mbit of incoming MPEG-TS traffic in flussonic, we use about 50% of moderate 4 core xeon e3 server.

When we switch to our driver implementation of udp that collapses several contiguous udp messages into single big message (it is allowed for mpegts) we reduce usage to 15-20%

I can't tell that it is "badly written udp in erlang", just messaging is rather expensive.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Heavy duty UDP server performance

Sergej Jurečko
Yeah same experience and solution that we used. I presume there is a lot allocation/fragmentation going on with udp, since packets tend to be small and there are so many of them.

Sergej

> On 03 Feb 2016, at 17:23, Max Lapshin <[hidden email]> wrote:
>
> When we use plain gen_udp to accept 200-400 mbit of incoming MPEG-TS traffic in flussonic, we use about 50% of moderate 4 core xeon e3 server.
>
> When we switch to our driver implementation of udp that collapses several contiguous udp messages into single big message (it is allowed for mpegts) we reduce usage to 15-20%
>
> I can't tell that it is "badly written udp in erlang", just messaging is rather expensive.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Heavy duty UDP server performance

Kostis Sagonas-2
On 02/03/2016 05:26 PM, Sergej Jurečko wrote:

> Yeah same experience and solution that we used. I presume there is a lot allocation/fragmentation going on with udp, since packets tend to be small and there are so many of them.
>
> Sergej
>
>> >On 03 Feb 2016, at 17:23, Max Lapshin<[hidden email]>  wrote:
>> >
>> >When we use plain gen_udp to accept 200-400 mbit of incoming MPEG-TS traffic in flussonic, we use about 50% of moderate 4 core xeon e3 server.
>> >
>> >When we switch to our driver implementation of udp that collapses several contiguous udp messages into single big message (it is allowed for mpegts) we reduce usage to 15-20%
>> >
>> >I can't tell that it is "badly written udp in erlang", just messaging is rather expensive.

To an outsider, like me, none of the above answers do not reply to the
original question which read:

   On what basis do you make that claim? Also, writing a NIF that
actually provides better performance without blocking the scheduler is
non-trivial, even if UDP is simple to work with in C.

Some questions:

  1. Is your driver implementation one that does not block a scheduler?

  2. Why is % of CPU usage (rather than whether or not you can achieve
the throughput which the application requires) a useful/interesting
metric here?

Like Sean, I am not disputing that one can write a more performant UDP
server in C rather than in Erlang, but I am just curious why writing in
Erlang is not sufficient for many/most applications or advantageous for
other reasons.

Kostis

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Heavy duty UDP server performance

Sean Cribbs
In reply to this post by Max Lapshin-2
This difference (batching improves performance) is interesting to me because it reminds me of a problem we had where many small writes to gen_tcp slogged performance, whereas batching them to something closer to MTU or TCP buffer size greatly improved throughput. Calling into the driver added ~30ms when sending tiny messages, even with Nagle off.

I'm curious, did you try gen_udp with the {active, N} option, or were they already running in {active, true} mode?

On Wed, Feb 3, 2016 at 10:23 AM, Max Lapshin <[hidden email]> wrote:
When we use plain gen_udp to accept 200-400 mbit of incoming MPEG-TS traffic in flussonic, we use about 50% of moderate 4 core xeon e3 server.

When we switch to our driver implementation of udp that collapses several contiguous udp messages into single big message (it is allowed for mpegts) we reduce usage to 15-20%

I can't tell that it is "badly written udp in erlang", just messaging is rather expensive.


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Heavy duty UDP server performance

Max Lapshin-2
Kostis, problem is that when load of CPU is going higher than 50% you may start loosing packets.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Heavy duty UDP server performance

Ameretat Reith
In reply to this post by Sean Cribbs
On Wed, 3 Feb 2016 12:24:40 -0600
Sean Cribbs <[hidden email]> wrote:

> I'm curious, did you try gen_udp with the {active, N} option, or were
> they already running in {active, true} mode?

I got best results by {active, once}.  I also played with
{read_packets, N} socket option but nothing changed that much.
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Heavy duty UDP server performance

Ameretat Reith
In reply to this post by Max Lapshin-2
On Wed, 3 Feb 2016 22:23:36 +0300
Max Lapshin <[hidden email]> wrote:

> Kostis, problem is that when load of CPU is going higher than 50% you
> may start loosing packets.

How do you measure packet loss?  Is there some kind of perf event for
example to find them?  I see no increase in interface dropped packets
counter (with ifconfig), looks packets being dropped in Erlang side,
not in kernel buffer.
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Heavy duty UDP server performance

Max Lapshin-2
/proc/net/udp

I promise to extract library for reading /proc/net/udp  /proc/stat /proc/diskstats  /proc/meminfo /proc/net/dev   that we are using to monitor server on which flussonic is running

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Heavy duty UDP server performance

Theepan
In reply to this post by Max Lapshin-2
Max,

You don't loose packets simply because the CPU reached 50%. or even 80%. 

On Thu, Feb 4, 2016 at 12:53 AM, Max Lapshin <[hidden email]> wrote:
Kostis, problem is that when load of CPU is going higher than 50% you may start loosing packets.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Heavy duty UDP server performance

Max Lapshin-2
Well, I'm not going to argue about it, but I know that it is a serious blocker for us: when flussonic is consuming 50% of all cores only on capturing (unpacking mpegts is another pain) when code in C takes only 10-15% for this task, customers are complaining.



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Heavy duty UDP server performance

Ameretat Reith
I simplified scenario and made a stress tester for this use case: Handle each
UDP socket in a gen_server and send a UDP packet in every miliseconds [1].

It won't reach more that 280Mbit/s on my Core 2 duo system without sending
anything to wire.  At this point CPU will be a bottleneck here.  I sent perf
report in `out` directory of repository [2] and it shows still time spent in
process_main is high.

On our production servers with Xeon E3-1230 CPUs and low latency (.20ms
between servers), I can fill 1Gbits link: send 1400 byte packets each 20ms
from 1800 ports to 1800 ports, and measure bandwidth by received packets.
I can transfer with 1Gbit/s speed but at this point CPU usage is above 50%.
By overloading system, I can see no packet drop from /proc/net/udp but
response time drops considerably.  I think Erlang get packets from kernel very
soon and buffers getting overrun in Erlang side, not sure how to measure them
then.  I disabled and enabled kernel poll, perf report same time spent but on
different functions.

On Fri, Feb 5, 2016 at 12:17 PM, Max Lapshin <[hidden email]> wrote:
Well, I'm not going to argue about it, but I know that it is a serious blocker for us: when flussonic is consuming 50% of all cores only on capturing (unpacking mpegts is another pain) when code in C takes only 10-15% for this task, customers are complaining.



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Heavy duty UDP server performance

Chandru-4
Hi,

I rewrote your client slightly and got better throughput than what you are getting. Tests were run on a 2.8 GHz Intel Core i7 running OS X.


23:52:18.712 [notice] listening on udp 12000

...

23:52:18.780 [notice] listening on udp 12993

23:52:18.781 [notice] listening on udp 12994

23:52:18.781 [notice] listening on udp 12995

23:52:18.781 [notice] listening on udp 12996

23:52:18.781 [notice] listening on udp 12997

23:52:18.781 [notice] listening on udp 12998

23:52:18.781 [notice] listening on udp 12999

23:52:18.781 [notice] listening on udp 13000

23:52:28.718 [notice] 1454975548: recv_pkts: 0 recv_size: 0 sent_pkts: 0 sent_size: 0 recv: 0.000000 Mbit/s send: 0.000000 Mbit/s

23:52:38.724 [notice] 1454975558: recv_pkts: 0 recv_size: 0 sent_pkts: 0 sent_size: 0 recv: 0.000000 Mbit/s send: 0.000000 Mbit/s

23:52:48.725 [notice] 1454975568: recv_pkts: 0 recv_size: 0 sent_pkts: 0 sent_size: 0 recv: 0.000000 Mbit/s send: 0.000000 Mbit/s

23:52:58.728 [notice] 1454975578: recv_pkts: 679648 recv_size: 951507200 sent_pkts: 679648 sent_size: 3398240 recv: 761.205760 Mbit/s send: 2.718592 Mbit/s

23:53:08.729 [notice] 1454975588: recv_pkts: 652524 recv_size: 913533600 sent_pkts: 652524 sent_size: 3262620 recv: 730.826880 Mbit/s send: 2.610096 Mbit/s

23:53:18.730 [notice] 1454975598: recv_pkts: 638936 recv_size: 894510400 sent_pkts: 638936 sent_size: 3194680 recv: 715.608320 Mbit/s send: 2.555744 Mbit/s

23:53:28.733 [notice] 1454975608: recv_pkts: 618893 recv_size: 866450200 sent_pkts: 618893 sent_size: 3094465 recv: 693.160160 Mbit/s send: 2.475572 Mbit/s

23:53:38.735 [notice] 1454975618: recv_pkts: 620698 recv_size: 868977200 sent_pkts: 620698 sent_size: 3103490 recv: 695.181760 Mbit/s send: 2.482792 Mbit/s

23:53:48.736 [notice] 1454975628: recv_pkts: 610931 recv_size: 855303400 sent_pkts: 610931 sent_size: 3054655 recv: 684.242720 Mbit/s send: 2.443724 Mbit/s

23:53:58.738 [notice] 1454975638: recv_pkts: 623615 recv_size: 873061000 sent_pkts: 623615 sent_size: 3118075 recv: 698.448800 Mbit/s send: 2.494460 Mbit/s

23:54:08.739 [notice] 1454975648: recv_pkts: 629565 recv_size: 881391000 sent_pkts: 629565 sent_size: 3147825 recv: 705.112800 Mbit/s send: 2.518260 Mbit/s

23:54:18.740 [notice] 1454975658: recv_pkts: 624504 recv_size: 874305600 sent_pkts: 624504 sent_size: 3122520 recv: 699.444480 Mbit/s send: 2.498016 Mbit/s

23:54:28.741 [notice] 1454975668: recv_pkts: 625500 recv_size: 875700000 sent_pkts: 625500 sent_size: 3127500 recv: 700.560000 Mbit/s send: 2.502000 Mbit/s

23:54:38.742 [notice] 1454975678: recv_pkts: 615165 recv_size: 861231000 sent_pkts: 615165 sent_size: 3075825 recv: 688.984800 Mbit/s send: 2.460660 Mbit/s

23:54:48.743 [notice] 1454975688: recv_pkts: 620643 recv_size: 868900200 sent_pkts: 620643 sent_size: 3103215 recv: 695.120160 Mbit/s send: 2.482572 Mbit/s

23:54:58.744 [notice] 1454975698: recv_pkts: 623126 recv_size: 872376400 sent_pkts: 623126 sent_size: 3115630 recv: 697.901120 Mbit/s send: 2.492504 Mbit/s

23:55:08.746 [notice] 1454975708: recv_pkts: 630593 recv_size: 882830200 sent_pkts: 630593 sent_size: 3152965 recv: 706.264160 Mbit/s send: 2.522372 Mbit/s

23:55:18.747 [notice] 1454975718: recv_pkts: 623336 recv_size: 872670400 sent_pkts: 623336 sent_size: 3116680 recv: 698.136320 Mbit/s send: 2.493344 Mbit/s

23:55:28.749 [notice] 1454975728: recv_pkts: 611828 recv_size: 856559200 sent_pkts: 611828 sent_size: 3059140 recv: 685.247360 Mbit/s send: 2.447312 Mbit/s

23:55:38.750 [notice] 1454975738: recv_pkts: 626984 recv_size: 877777600 sent_pkts: 626984 sent_size: 3134920 recv: 702.222080 Mbit/s send: 2.507936 Mbit/s



On 8 February 2016 at 14:05, Ameretat Reith <[hidden email]> wrote:
I simplified scenario and made a stress tester for this use case: Handle each
UDP socket in a gen_server and send a UDP packet in every miliseconds [1].

It won't reach more that 280Mbit/s on my Core 2 duo system without sending
anything to wire.  At this point CPU will be a bottleneck here.  I sent perf
report in `out` directory of repository [2] and it shows still time spent in
process_main is high.

On our production servers with Xeon E3-1230 CPUs and low latency (.20ms
between servers), I can fill 1Gbits link: send 1400 byte packets each 20ms
from 1800 ports to 1800 ports, and measure bandwidth by received packets.
I can transfer with 1Gbit/s speed but at this point CPU usage is above 50%.
By overloading system, I can see no packet drop from /proc/net/udp but
response time drops considerably.  I think Erlang get packets from kernel very
soon and buffers getting overrun in Erlang side, not sure how to measure them
then.  I disabled and enabled kernel poll, perf report same time spent but on
different functions.

On Fri, Feb 5, 2016 at 12:17 PM, Max Lapshin <[hidden email]> wrote:
Well, I'm not going to argue about it, but I know that it is a serious blocker for us: when flussonic is consuming 50% of all cores only on capturing (unpacking mpegts is another pain) when code in C takes only 10-15% for this task, customers are complaining.



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Heavy duty UDP server performance

Max Lapshin-2
Is it on localhost?

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Heavy duty UDP server performance

Lukas Larsson-3
In reply to this post by Max Lapshin-2
Hello,

On Wed, Feb 3, 2016 at 5:23 PM, Max Lapshin <[hidden email]> wrote:
When we use plain gen_udp to accept 200-400 mbit of incoming MPEG-TS traffic in flussonic, we use about 50% of moderate 4 core xeon e3 server.

When we switch to our driver implementation of udp that collapses several contiguous udp messages into single big message (it is allowed for mpegts) we reduce usage to 15-20%

I can't tell that it is "badly written udp in erlang", just messaging is rather expensive.

Do you think that it is the batching that makes the performance difference? In that case do you think adding an option to gen_udp that does batching would help enough? There is already read_packets as an option, and maybe it makes sense to have an option that batches all reads done by read_packets into one erlang message.

i.e. you get something like:

{udp_batch, Socket, IP, InPortNo, [Packets]}

or possibly

{udp_batch, Socket,  [{IP, InPortNo, Packet}]}

Lukas




_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Heavy duty UDP server performance

Ameretat Reith
In reply to this post by Chandru-4
On Tue, Feb 9, 2016 at 3:28 AM, Chandru <[hidden email]> wrote:

    Hi,

    I rewrote your client slightly and got better throughput than what you are getting. Tests were run on a 2.8 GHz Intel Core i7 running OS X.

    https://github.com/cmullaparthi/udpstress


Thanks.  I tested your method and made same thing on server (avoiding gen_server) Here is what I'm getting:

VMARGS_PATH=$PWD/server.args ./bin/udpstress foreground -extra plain_server 1400 1600

06:04:31.122 [notice] 1455023071: recv_pkts: 91000 recv_size: 127400000 sent_pkts: 91000 sent_size: 455000 recv: 101.920000 Mbit/s send: 0.364000 Mbit/s
06:04:41.123 [notice] 1455023081: recv_pkts: 642473 recv_size: 899462200 sent_pkts: 642473 sent_size: 3212365 recv: 719.569760 Mbit/s send: 2.569892 Mbit/s
06:04:51.124 [notice] 1455023091: recv_pkts: 659013 recv_size: 922618200 sent_pkts: 659013 sent_size: 3295065 recv: 738.094560 Mbit/s send: 2.636052 Mbit/s
06:05:01.126 [notice] 1455023101: recv_pkts: 656831 recv_size: 919563400 sent_pkts: 656831 sent_size: 3284155 recv: 735.650720 Mbit/s send: 2.627324 Mbit/s
06:05:11.126 [notice] 1455023111: recv_pkts: 646297 recv_size: 904815800 sent_pkts: 646297 sent_size: 3231485 recv: 723.852640 Mbit/s send: 2.585188 Mbit/s
06:05:21.127 [notice] 1455023121: recv_pkts: 638607 recv_size: 894049800 sent_pkts: 638607 sent_size: 3193035 recv: 715.239840 Mbit/s send: 2.554428 Mbit/s
06:05:31.128 [notice] 1455023131: recv_pkts: 641356 recv_size: 897898400 sent_pkts: 641356 sent_size: 3206780 recv: 718.318720 Mbit/s send: 2.565424 Mbit/s

$ VMARGS_PATH=$PWD/server.args ./bin/udpstress foreground -extra genserver_server 1400 1600

06:06:41.262 [notice] 1455023201: recv_pkts: 238786 recv_size: 334300400 sent_pkts: 238786 sent_size: 1193930 recv: 267.440320 Mbit/s send: 0.955144 Mbit/s
06:06:51.262 [notice] 1455023211: recv_pkts: 646220 recv_size: 904708000 sent_pkts: 646220 sent_size: 3231100 recv: 723.766400 Mbit/s send: 2.584880 Mbit/s
06:07:01.263 [notice] 1455023221: recv_pkts: 647552 recv_size: 906572800 sent_pkts: 647552 sent_size: 3237760 recv: 725.258240 Mbit/s send: 2.590208 Mbit/s
06:07:11.264 [notice] 1455023231: recv_pkts: 642863 recv_size: 900008200 sent_pkts: 642863 sent_size: 3214315 recv: 720.006560 Mbit/s send: 2.571452 Mbit/s
06:07:21.265 [notice] 1455023241: recv_pkts: 644790 recv_size: 902706000 sent_pkts: 644790 sent_size: 3223950 recv: 722.164800 Mbit/s send: 2.579160 Mbit/s


Seems both servers fullfill the request rate, now I try to flood server: request every 20 milliseconds and see receive rate in server:

$ VMARGS_PATH=$PWD/client.args ./bin/udpstress foreground -extra plain_client server_addr -i 20 1400 3100

$ VMARGS_PATH=$PWD/server.args ./bin/udpstress foreground -extra plain_server 1400 3100

07:49:01.804 [notice] 1455029341: recv_pkts: 850948 recv_size: 1191327200 sent_pkts: 850948 sent_size: 4254740 recv: 953.061760 Mbit/s send: 3.403792 Mbit/s
07:49:11.805 [notice] 1455029351: recv_pkts: 851744 recv_size: 1192441600 sent_pkts: 851744 sent_size: 4258720 recv: 953.953280 Mbit/s send: 3.406976 Mbit/s

while using genserver_client less packets getting routed, server log:

07:53:41.832 [notice] 1455029621: recv_pkts: 810797 recv_size: 1135115800 sent_pkts: 810797 sent_size: 4053985 recv: 908.092640 Mbit/s send: 3.243188 Mbit/s
07:53:51.833 [notice] 1455029631: recv_pkts: 810403 recv_size: 1134564200 sent_pkts: 810403 sent_size: 4052015 recv: 907.651360 Mbit/s send: 3.241612 Mbit/s


Above logs are made using two Xeon E3 Quad core servers having about .20ms latency and codes I collected in [1].  I think problem is not using or not using gen_server, It's mostly because just getting UDP packets in 1Gbit/s takes that much CPU in Erlang.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Heavy duty UDP server performance

Jesper Louis Andersen-2

On Tue, Feb 9, 2016 at 4:10 PM, Ameretat Reith <[hidden email]> wrote:
I think problem is not using or not using gen_server, It's mostly because just getting UDP packets in 1Gbit/s takes that much CPU in Erlang.

Where is that time spent in the Erlang VM or in the Kernel? You are potentially on a wakeup schedule of 81 wakeups per millisecond to handle packets. Which suggests you need to understand where your CPU time is spent in the system in order to tune it for lower CPU usage.

Many systems can show lower CPU load, until you are forced to do things with the packets at which point they have to pay with CPU load in order to organize the data. Chances are such organization has already been done for you by the VM.



--
J.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
12