UDP receive performance

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
38 messages Options
12
Reply | Threaded
Open this post in threaded view
|

UDP receive performance

Danil Zagoskin-2
Hi!

We have a performance problem receiving lots of UDP traffic.
There are a lot (about 70) of UDP receive processes, each handling about 1 to 10 megabits of multicast traffic, with {active, N}.

msacc summary on my OSX laptop, build from OTP master c30309e799212b080c39ee2f91af3f9a0383d767 (Apr 19):


Thread alloc aux bifbusy_wait check_io emulator ets gc gc_full nif other port send sleep timers scheduler 30.02% 0.92% 2.86% 24.66% 0.01% 9.61% 0.03% 1.25% 0.20% 0.13% 2.34% 9.33% 0.41% 17.78% 0.44%

Linux production server behaves the same way (we do not have extended msacc there yet, so most of alloc goes to port).

perf top (on Linux production) says there's a lot of unaligned memmove:
  69.76%  libc-2.24.so        [.] __memmove_sse2_unaligned_erms
   6.13%  beam.smp            [.] process_main
   2.02%  beam.smp            [.] erts_schedule
   0.87%  [kernel]            [k] copy_user_enhanced_fast_string


I'll try to make a minimal example for this.
Maybe there are simple recommendations on optimizing this kind of load?

--
Danil Zagoskin | [hidden email]

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: UDP receive performance

Raimo Niskanen-2
On Wed, May 23, 2018 at 06:28:55PM +0300, Danil Zagoskin wrote:
> Hi!
>
> We have a performance problem receiving lots of UDP traffic.
> There are a lot (about 70) of UDP receive processes, each handling about 1
> to 10 megabits of multicast traffic, with {active, N}.

Whenever someone has UDP receive performance problems one has to ask if you
have seen the Erlang socket option {read_packets,N}?

See http://erlang.org/doc/man/inet.html#setopts-2

>
> msacc summary on my OSX laptop, build from OTP master
> c30309e799212b080c39ee2f91af3f9a0383d767 (Apr 19):
>
>
>         Thread    alloc      aux      bifbusy_wait check_io emulator
>    ets       gc  gc_full      nif    other     port     send    sleep
>  timers
>      scheduler   30.02%    0.92%    2.86%   24.66%    0.01%    9.61%
>  0.03%    1.25%    0.20%    0.13%    2.34%    9.33%    0.41%   17.78%
>   0.44%
>
>
> Linux production server behaves the same way (we do not have extended msacc
> there yet, so most of alloc goes to port).
>
> perf top (on Linux production) says there's a lot of unaligned memmove:
>
>   69.76%  libc-2.24.so        [.] __memmove_sse2_unaligned_erms
>    6.13%  beam.smp            [.] process_main
>    2.02%  beam.smp            [.] erts_schedule
>    0.87%  [kernel]            [k] copy_user_enhanced_fast_string
>
>
> I'll try to make a minimal example for this.
> Maybe there are simple recommendations on optimizing this kind of load?
>
> --
> Danil Zagoskin | [hidden email]

> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions


--

/ Raimo Niskanen, Erlang/OTP, Ericsson AB
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: UDP receive performance

Danil Zagoskin-2
Yes, we have {read_packets, 100} in receive socket options.

On Thu, May 24, 2018 at 10:23 AM, Raimo Niskanen <[hidden email]> wrote:
On Wed, May 23, 2018 at 06:28:55PM +0300, Danil Zagoskin wrote:
> Hi!
>
> We have a performance problem receiving lots of UDP traffic.
> There are a lot (about 70) of UDP receive processes, each handling about 1
> to 10 megabits of multicast traffic, with {active, N}.

Whenever someone has UDP receive performance problems one has to ask if you
have seen the Erlang socket option {read_packets,N}?

See http://erlang.org/doc/man/inet.html#setopts-2

>
> msacc summary on my OSX laptop, build from OTP master
> c30309e799212b080c39ee2f91af3f9a0383d767 (Apr 19):
>
>
>         Thread    alloc      aux      bifbusy_wait check_io emulator
>    ets       gc  gc_full      nif    other     port     send    sleep
>  timers
>      scheduler   30.02%    0.92%    2.86%   24.66%    0.01%    9.61%
>  0.03%    1.25%    0.20%    0.13%    2.34%    9.33%    0.41%   17.78%
>   0.44%
>
>
> Linux production server behaves the same way (we do not have extended msacc
> there yet, so most of alloc goes to port).
>
> perf top (on Linux production) says there's a lot of unaligned memmove:
>
>   69.76%  libc-2.24.so        [.] __memmove_sse2_unaligned_erms
>    6.13%  beam.smp            [.] process_main
>    2.02%  beam.smp            [.] erts_schedule
>    0.87%  [kernel]            [k] copy_user_enhanced_fast_string
>
>
> I'll try to make a minimal example for this.
> Maybe there are simple recommendations on optimizing this kind of load?
>
> --
> Danil Zagoskin | [hidden email]

> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions


--

/ Raimo Niskanen, Erlang/OTP, Ericsson AB
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



--
Danil Zagoskin | [hidden email]

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: UDP receive performance

Sergej Jurečko
OTP-21 rc1 has enhanced IO scalability. Have you tried if it is any better? UDP performance in Erlang was never great... 

Regards,
Sergej

On 24 May 2018, at 12:03, Danil Zagoskin <[hidden email]> wrote:

Yes, we have {read_packets, 100} in receive socket options.

On Thu, May 24, 2018 at 10:23 AM, Raimo Niskanen <[hidden email]> wrote:
On Wed, May 23, 2018 at 06:28:55PM +0300, Danil Zagoskin wrote:
> Hi!
>
> We have a performance problem receiving lots of UDP traffic.
> There are a lot (about 70) of UDP receive processes, each handling about 1
> to 10 megabits of multicast traffic, with {active, N}.

Whenever someone has UDP receive performance problems one has to ask if you
have seen the Erlang socket option {read_packets,N}?

See http://erlang.org/doc/man/inet.html#setopts-2

>
> msacc summary on my OSX laptop, build from OTP master
> c30309e799212b080c39ee2f91af3f9a0383d767 (Apr 19):
>
>
>         Thread    alloc      aux      bifbusy_wait check_io emulator
>    ets       gc  gc_full      nif    other     port     send    sleep
>  timers
>      scheduler   30.02%    0.92%    2.86%   24.66%    0.01%    9.61%
>  0.03%    1.25%    0.20%    0.13%    2.34%    9.33%    0.41%   17.78%
>   0.44%
>
>
> Linux production server behaves the same way (we do not have extended msacc
> there yet, so most of alloc goes to port).
>
> perf top (on Linux production) says there's a lot of unaligned memmove:
>
>   69.76%  libc-2.24.so        [.] __memmove_sse2_unaligned_erms
>    6.13%  beam.smp            [.] process_main
>    2.02%  beam.smp            [.] erts_schedule
>    0.87%  [kernel]            [k] copy_user_enhanced_fast_string
>
>
> I'll try to make a minimal example for this.
> Maybe there are simple recommendations on optimizing this kind of load?
>
> --
> Danil Zagoskin | [hidden email]

> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions


--

/ Raimo Niskanen, Erlang/OTP, Ericsson AB
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



--
Danil Zagoskin | [hidden email]
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: UDP receive performance

Danil Zagoskin-2
Yes, I've built a fresh master today (Erlang/OTP 21 [RELEASE CANDIDATE 1] [erts-9.3.1]), and nothing has changed.

On Thu, May 24, 2018 at 1:17 PM, Sergej Jurečko <[hidden email]> wrote:
OTP-21 rc1 has enhanced IO scalability. Have you tried if it is any better? UDP performance in Erlang was never great... 

Regards,
Sergej


On 24 May 2018, at 12:03, Danil Zagoskin <[hidden email]> wrote:

Yes, we have {read_packets, 100} in receive socket options.

On Thu, May 24, 2018 at 10:23 AM, Raimo Niskanen <[hidden email]> wrote:
On Wed, May 23, 2018 at 06:28:55PM +0300, Danil Zagoskin wrote:
> Hi!
>
> We have a performance problem receiving lots of UDP traffic.
> There are a lot (about 70) of UDP receive processes, each handling about 1
> to 10 megabits of multicast traffic, with {active, N}.

Whenever someone has UDP receive performance problems one has to ask if you
have seen the Erlang socket option {read_packets,N}?

See http://erlang.org/doc/man/inet.html#setopts-2

>
> msacc summary on my OSX laptop, build from OTP master
> c30309e799212b080c39ee2f91af3f9a0383d767 (Apr 19):
>
>
>         Thread    alloc      aux      bifbusy_wait check_io emulator
>    ets       gc  gc_full      nif    other     port     send    sleep
>  timers
>      scheduler   30.02%    0.92%    2.86%   24.66%    0.01%    9.61%
>  0.03%    1.25%    0.20%    0.13%    2.34%    9.33%    0.41%   17.78%
>   0.44%
>
>
> Linux production server behaves the same way (we do not have extended msacc
> there yet, so most of alloc goes to port).
>
> perf top (on Linux production) says there's a lot of unaligned memmove:
>
>   69.76%  libc-2.24.so        [.] __memmove_sse2_unaligned_erms
>    6.13%  beam.smp            [.] process_main
>    2.02%  beam.smp            [.] erts_schedule
>    0.87%  [kernel]            [k] copy_user_enhanced_fast_string
>
>
> I'll try to make a minimal example for this.
> Maybe there are simple recommendations on optimizing this kind of load?
>
> --
> Danil Zagoskin | [hidden email]

> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions


--

/ Raimo Niskanen, Erlang/OTP, Ericsson AB
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



--
Danil Zagoskin | [hidden email]
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions




--
Danil Zagoskin | [hidden email]

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: UDP receive performance

Lukas Larsson-8
Can you run perf with "--call-graph dwarf" and see which functions it is that call memmove?

On Thu, May 24, 2018 at 12:21 PM, Danil Zagoskin <[hidden email]> wrote:
Yes, I've built a fresh master today (Erlang/OTP 21 [RELEASE CANDIDATE 1] [erts-9.3.1]), and nothing has changed.

On Thu, May 24, 2018 at 1:17 PM, Sergej Jurečko <[hidden email]> wrote:
OTP-21 rc1 has enhanced IO scalability. Have you tried if it is any better? UDP performance in Erlang was never great... 

Regards,
Sergej


On 24 May 2018, at 12:03, Danil Zagoskin <[hidden email]> wrote:

Yes, we have {read_packets, 100} in receive socket options.

On Thu, May 24, 2018 at 10:23 AM, Raimo Niskanen <[hidden email]> wrote:
On Wed, May 23, 2018 at 06:28:55PM +0300, Danil Zagoskin wrote:
> Hi!
>
> We have a performance problem receiving lots of UDP traffic.
> There are a lot (about 70) of UDP receive processes, each handling about 1
> to 10 megabits of multicast traffic, with {active, N}.

Whenever someone has UDP receive performance problems one has to ask if you
have seen the Erlang socket option {read_packets,N}?

See http://erlang.org/doc/man/inet.html#setopts-2

>
> msacc summary on my OSX laptop, build from OTP master
> c30309e799212b080c39ee2f91af3f9a0383d767 (Apr 19):
>
>
>         Thread    alloc      aux      bifbusy_wait check_io emulator
>    ets       gc  gc_full      nif    other     port     send    sleep
>  timers
>      scheduler   30.02%    0.92%    2.86%   24.66%    0.01%    9.61%
>  0.03%    1.25%    0.20%    0.13%    2.34%    9.33%    0.41%   17.78%
>   0.44%
>
>
> Linux production server behaves the same way (we do not have extended msacc
> there yet, so most of alloc goes to port).
>
> perf top (on Linux production) says there's a lot of unaligned memmove:
>
>   69.76%  libc-2.24.so        [.] __memmove_sse2_unaligned_erms
>    6.13%  beam.smp            [.] process_main
>    2.02%  beam.smp            [.] erts_schedule
>    0.87%  [kernel]            [k] copy_user_enhanced_fast_string
>
>
> I'll try to make a minimal example for this.
> Maybe there are simple recommendations on optimizing this kind of load?
>
> --
> Danil Zagoskin | [hidden email]

> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions


--

/ Raimo Niskanen, Erlang/OTP, Ericsson AB
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



--
Danil Zagoskin | [hidden email]
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions




--
Danil Zagoskin | [hidden email]

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: UDP receive performance

Jesper Louis Andersen-2
In reply to this post by Danil Zagoskin-2
On Wed, May 23, 2018 at 5:29 PM Danil Zagoskin <[hidden email]> wrote:
Hi!

We have a performance problem receiving lots of UDP traffic.
There are a lot (about 70) of UDP receive processes, each handling about 1 to 10 megabits of multicast traffic, with {active, N}.



Suppose you read the packets, and then throw everything away, as a test. Are you then fast enough, or do you have a problem still? Chances are that the problem isn't the reception by itself.

memmove just means you are moving memory around a lot, but the question is: why?

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: UDP receive performance

Danil Zagoskin-2
I've made a simple example.

The code is at https://gist.github.com/stolen/40eebd6225faf821153f1eeb5374f068

I added the 239.0.0.0/8 route as direct loopback (lo0 in OSX) to avoid network driver overhead.

40 dummy readers are enough to eat almost all 4 cores on my macbook (quite old i7 2 GHz):

Erlang/OTP 21 [RELEASE CANDIDATE 1] [erts-9.3.1] [source-dfc935298b] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe]

1> c(udptest).
{ok,udptest}
2> udptest:start_sender({239,9,9,9}, 3999).
<0.82.0>
3> [udptest:start_reader({239,9,9,9}, 3999) || _ <- lists:seq(1, 40)].
[<0.84.0>,<0.85.0>,<0.86.0>,<0.87.0>,<0.88.0>,<0.89.0>,
 <0.90.0>,<0.91.0>,<0.92.0>,<0.93.0>,<0.94.0>,<0.95.0>,
 <0.96.0>,<0.97.0>,<0.98.0>,<0.99.0>,<0.100.0>,<0.101.0>,
 <0.102.0>,<0.103.0>,<0.104.0>,<0.105.0>,<0.106.0>,<0.107.0>,
 <0.108.0>,<0.109.0>,<0.110.0>,<0.111.0>,<0.112.0>|...]
4> msacc:start(10000), msacc:print().
...
        Thread    alloc      aux      bifbusy_wait check_io emulator      ets       gc  gc_full      nif    other     port     send    sleep   timers
...
     scheduler   59.95%    0.62%    0.14%   10.65%    0.00%    1.40%    0.00%    0.39%    0.00%    0.00%    1.74%   17.43%    0.00%    7.62%    0.06%


I'll build the fresh OTP on Linux box and check perf again.

On Thu, May 24, 2018 at 1:58 PM, Jesper Louis Andersen <[hidden email]> wrote:
On Wed, May 23, 2018 at 5:29 PM Danil Zagoskin <[hidden email]> wrote:
Hi!

We have a performance problem receiving lots of UDP traffic.
There are a lot (about 70) of UDP receive processes, each handling about 1 to 10 megabits of multicast traffic, with {active, N}.



Suppose you read the packets, and then throw everything away, as a test. Are you then fast enough, or do you have a problem still? Chances are that the problem isn't the reception by itself.

memmove just means you are moving memory around a lot, but the question is: why?



--
Danil Zagoskin | [hidden email]

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: UDP receive performance

Max Lapshin-2
Problem is that we need these udp packets to become one single big packet.  Data flow of small 1500 byte packets is killing us =(

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: UDP receive performance

Chaitanya Chalasani-4
We ran into the same performance problem. 

Now we are trying to offload the UDP packets receiver and accumulator in C using nif and pass on the complete message to Erlang. Not sure if we this will be good alternative.

On 24-May-2018, at 17:21, Max Lapshin <[hidden email]> wrote:

Problem is that we need these udp packets to become one single big packet.  Data flow of small 1500 byte packets is killing us =(


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: UDP receive performance

Danil Zagoskin-2
In reply to this post by Lukas Larsson-8
-    2.57%     0.02%  31_scheduler     beam.smp                            [.] process_main  
   - 2.55% process_main
      - 2.43% erts_schedule 
         - 2.26% erts_port_task_execute
            - 2.25% packet_inet_input.isra.31
               - 2.05% driver_realloc_binary
                  - 2.05% realloc_thr_pref
                       1.87% __memmove_avx_unaligned_erms

That's 40-core Xeon E5-2640, so 2.5% on a single scheduler is kind of 100%
Also it's Linux kernel 4.9


On a machine with kernel 4.13 and quad-core Xeon E31225 on a half of E5's load we have:
-   16.11%     0.10%  1_scheduler      beam.smp                       [.] erts_schedule 
   - 16.01% erts_schedule
      - 13.62% erts_port_task_execute 
         - 13.11% packet_inet_input.isra.31
            - 11.37% driver_realloc_binary
               - 11.33% realloc_thr_pref 
                  - 10.50% __memcpy_avx_unaligned 
                       5.06% __memcpy_avx_unaligned 
                     + 1.04% page_fault
                    0.66% do_erts_alcu_realloc.constprop.31
            + 0.79% 0x108f3
              0.55% driver_deliver_term 
        1.30% sched_spin_wait

Seems like kernel version may change a lot, will run more tests.

But it seems like memory operations are unaligned which could be not very efficient.

On Thu, May 24, 2018 at 1:24 PM, Lukas Larsson <[hidden email]> wrote:
Can you run perf with "--call-graph dwarf" and see which functions it is that call memmove?

On Thu, May 24, 2018 at 12:21 PM, Danil Zagoskin <[hidden email]> wrote:
Yes, I've built a fresh master today (Erlang/OTP 21 [RELEASE CANDIDATE 1] [erts-9.3.1]), and nothing has changed.

On Thu, May 24, 2018 at 1:17 PM, Sergej Jurečko <[hidden email]> wrote:
OTP-21 rc1 has enhanced IO scalability. Have you tried if it is any better? UDP performance in Erlang was never great... 

Regards,
Sergej


On 24 May 2018, at 12:03, Danil Zagoskin <[hidden email]> wrote:

Yes, we have {read_packets, 100} in receive socket options.

On Thu, May 24, 2018 at 10:23 AM, Raimo Niskanen <[hidden email]> wrote:
On Wed, May 23, 2018 at 06:28:55PM +0300, Danil Zagoskin wrote:
> Hi!
>
> We have a performance problem receiving lots of UDP traffic.
> There are a lot (about 70) of UDP receive processes, each handling about 1
> to 10 megabits of multicast traffic, with {active, N}.

Whenever someone has UDP receive performance problems one has to ask if you
have seen the Erlang socket option {read_packets,N}?

See http://erlang.org/doc/man/inet.html#setopts-2

>
> msacc summary on my OSX laptop, build from OTP master
> c30309e799212b080c39ee2f91af3f9a0383d767 (Apr 19):
>
>
>         Thread    alloc      aux      bifbusy_wait check_io emulator
>    ets       gc  gc_full      nif    other     port     send    sleep
>  timers
>      scheduler   30.02%    0.92%    2.86%   24.66%    0.01%    9.61%
>  0.03%    1.25%    0.20%    0.13%    2.34%    9.33%    0.41%   17.78%
>   0.44%
>
>
> Linux production server behaves the same way (we do not have extended msacc
> there yet, so most of alloc goes to port).
>
> perf top (on Linux production) says there's a lot of unaligned memmove:
>
>   69.76%  libc-2.24.so        [.] __memmove_sse2_unaligned_erms
>    6.13%  beam.smp            [.] process_main
>    2.02%  beam.smp            [.] erts_schedule
>    0.87%  [kernel]            [k] copy_user_enhanced_fast_string
>
>
> I'll try to make a minimal example for this.
> Maybe there are simple recommendations on optimizing this kind of load?
>
> --
> Danil Zagoskin | [hidden email]

> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions


--

/ Raimo Niskanen, Erlang/OTP, Ericsson AB
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



--
Danil Zagoskin | [hidden email]
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions




--
Danil Zagoskin | [hidden email]

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions





--
Danil Zagoskin | [hidden email]

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: UDP receive performance

Jesper Louis Andersen-2
I looked up the unaligned stuff. There are no aligned variant, and the unaligned variant just sets up a prologue before entering the main loop where you do have alignment. So I wouldn't worry about that, but more about where the calls are being made and where the memory is copied around.

On Thu, May 24, 2018 at 5:35 PM Danil Zagoskin <[hidden email]> wrote:
-    2.57%     0.02%  31_scheduler     beam.smp                            [.] process_main  
   - 2.55% process_main
      - 2.43% erts_schedule 
         - 2.26% erts_port_task_execute
            - 2.25% packet_inet_input.isra.31
               - 2.05% driver_realloc_binary
                  - 2.05% realloc_thr_pref
                       1.87% __memmove_avx_unaligned_erms

That's 40-core Xeon E5-2640, so 2.5% on a single scheduler is kind of 100%
Also it's Linux kernel 4.9


On a machine with kernel 4.13 and quad-core Xeon E31225 on a half of E5's load we have:
-   16.11%     0.10%  1_scheduler      beam.smp                       [.] erts_schedule 
   - 16.01% erts_schedule
      - 13.62% erts_port_task_execute 
         - 13.11% packet_inet_input.isra.31
            - 11.37% driver_realloc_binary
               - 11.33% realloc_thr_pref 
                  - 10.50% __memcpy_avx_unaligned 
                       5.06% __memcpy_avx_unaligned 
                     + 1.04% page_fault
                    0.66% do_erts_alcu_realloc.constprop.31
            + 0.79% 0x108f3
              0.55% driver_deliver_term 
        1.30% sched_spin_wait

Seems like kernel version may change a lot, will run more tests.

But it seems like memory operations are unaligned which could be not very efficient.

On Thu, May 24, 2018 at 1:24 PM, Lukas Larsson <[hidden email]> wrote:
Can you run perf with "--call-graph dwarf" and see which functions it is that call memmove?

On Thu, May 24, 2018 at 12:21 PM, Danil Zagoskin <[hidden email]> wrote:
Yes, I've built a fresh master today (Erlang/OTP 21 [RELEASE CANDIDATE 1] [erts-9.3.1]), and nothing has changed.

On Thu, May 24, 2018 at 1:17 PM, Sergej Jurečko <[hidden email]> wrote:
OTP-21 rc1 has enhanced IO scalability. Have you tried if it is any better? UDP performance in Erlang was never great... 

Regards,
Sergej


On 24 May 2018, at 12:03, Danil Zagoskin <[hidden email]> wrote:

Yes, we have {read_packets, 100} in receive socket options.

On Thu, May 24, 2018 at 10:23 AM, Raimo Niskanen <[hidden email]> wrote:
On Wed, May 23, 2018 at 06:28:55PM +0300, Danil Zagoskin wrote:
> Hi!
>
> We have a performance problem receiving lots of UDP traffic.
> There are a lot (about 70) of UDP receive processes, each handling about 1
> to 10 megabits of multicast traffic, with {active, N}.

Whenever someone has UDP receive performance problems one has to ask if you
have seen the Erlang socket option {read_packets,N}?

See http://erlang.org/doc/man/inet.html#setopts-2

>
> msacc summary on my OSX laptop, build from OTP master
> c30309e799212b080c39ee2f91af3f9a0383d767 (Apr 19):
>
>
>         Thread    alloc      aux      bifbusy_wait check_io emulator
>    ets       gc  gc_full      nif    other     port     send    sleep
>  timers
>      scheduler   30.02%    0.92%    2.86%   24.66%    0.01%    9.61%
>  0.03%    1.25%    0.20%    0.13%    2.34%    9.33%    0.41%   17.78%
>   0.44%
>
>
> Linux production server behaves the same way (we do not have extended msacc
> there yet, so most of alloc goes to port).
>
> perf top (on Linux production) says there's a lot of unaligned memmove:
>
>   69.76%  libc-2.24.so        [.] __memmove_sse2_unaligned_erms
>    6.13%  beam.smp            [.] process_main
>    2.02%  beam.smp            [.] erts_schedule
>    0.87%  [kernel]            [k] copy_user_enhanced_fast_string
>
>
> I'll try to make a minimal example for this.
> Maybe there are simple recommendations on optimizing this kind of load?
>
> --
> Danil Zagoskin | [hidden email]

> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions


--

/ Raimo Niskanen, Erlang/OTP, Ericsson AB
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



--
Danil Zagoskin | [hidden email]
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions




--
Danil Zagoskin | [hidden email]

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions





--
Danil Zagoskin | [hidden email]
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: UDP receive performance

Lukas Larsson-8
In reply to this post by Danil Zagoskin-2


On Thu, May 24, 2018 at 5:34 PM, Danil Zagoskin <[hidden email]> wrote:
-    2.57%     0.02%  31_scheduler     beam.smp                            [.] process_main  
   - 2.55% process_main
      - 2.43% erts_schedule 
         - 2.26% erts_port_task_execute
            - 2.25% packet_inet_input.isra.31
               - 2.05% driver_realloc_binary
                  - 2.05% realloc_thr_pref
                       1.87% __memmove_avx_unaligned_erms

That's 40-core Xeon E5-2640, so 2.5% on a single scheduler is kind of 100%
Also it's Linux kernel 4.9


On a machine with kernel 4.13 and quad-core Xeon E31225 on a half of E5's load we have:
-   16.11%     0.10%  1_scheduler      beam.smp                       [.] erts_schedule 
   - 16.01% erts_schedule
      - 13.62% erts_port_task_execute 
         - 13.11% packet_inet_input.isra.31
            - 11.37% driver_realloc_binary
               - 11.33% realloc_thr_pref 
                  - 10.50% __memcpy_avx_unaligned 
                       5.06% __memcpy_avx_unaligned 
                     + 1.04% page_fault
                    0.66% do_erts_alcu_realloc.constprop.31
            + 0.79% 0x108f3
              0.55% driver_deliver_term 
        1.30% sched_spin_wait

Seems like kernel version may change a lot, will run more tests.

But it seems like memory operations are unaligned which could be not very efficient.

I'm not able to re-produce your benchmark, for some reason I don't get the load that you get.

Anyways, I stared a bit at the code and you get a lot of realloc that move, which is not good at all. Something that caught my eye was that the recbuf is 2 MB, while the packets you receive are a lot smaller. One of the side-effects of setting a large recbuf is that the user space buffer is also increased to the same value. I don't think you want this to happen in your case. What happens if you set buffer to the MTU?

Why would changing the user buffer size effect performance? Well, the udp read is done into the user-space buffer of the given size. When that size is 2 MB it is placed by erts_alloc inside a SBC (single block carrier). Then later when is it known how much data was actually received, a realloc is made of the 2 MB buffer to the size of the received data. This moves the data across the SBC border to the MBS (multi block carrier) and the realloc will have to copy the data in the binary. So by lowering the user-space buffer to be small enough to be placed in the MBC from the start, the move in realloc should disappear. That is if my theory is correct.

Lukas

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: UDP receive performance

Max Lapshin-2
Ok, will check with reducing buffer.

We have put 2 MB and even 16 MB because without it, we got packet drops

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: UDP receive performance

Lukas Larsson-8


On Sat, May 26, 2018 at 3:38 AM, Max Lapshin <[hidden email]> wrote:
Ok, will check with reducing buffer.

We have put 2 MB and even 16 MB because without it, we got packet drops

You could also try raising the sbct limit to over 2 MB. i.e.

%% Raise sbct limit
+MBsbct 4096 +MBlmbcs 20480

or

%% lower user-space buffer
Common = [binary,{reuseaddr,true},{buffer,256*1024},{recbuf,2*1024*1024},inet,{read_packets,100},{active,500}],

 

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: UDP receive performance

scott ribe
Have you considered opening multiple sockets from multiple processes on the same port? It seems that your major problem was with  high CPU usage so I'm not sure that technique would help--unless you were maxing out only a single core. On BSD and recent Linux kernels this will result in incoming packets being load-balanced across the sockets.

It's not directly supported in Erlang, but the "raw" option lets you do it. I've got it working on OS X, though I've not yet load tested it. I haven't had the chance to verify the Linux branch, but here it is (the following is Elixir, not Erlang, but close enough):

  def reuseport_option() do
    case :os.type() do
      {:unix, name} ->
        cond do
          name in [:darwin, :freebsd, :openbsd, :netbsd] ->
            [{:raw, 0xffff, 0x0200, <<1::native-integer-size(32)>>} ]
          name in [:linux] ->
            [{:raw, 1, 15, <<1::native-integer-size(32)>>} ]
          false ->
            []
        end
    _ ->
      []
    end
  end

Just append that result to your options. (Of course, if you run on a platform which is not supported here, if you try to open second and subsequent sockets you'll get errors, so use the [] return as a "can't do this" flag if there's any chance you'll run on such a platform.)

--
Scott Ribe
[hidden email]
https://www.linkedin.com/in/scottribe/



> On May 26, 2018, at 12:17 AM, Lukas Larsson <[hidden email]> wrote:
>
>
>
> On Sat, May 26, 2018 at 3:38 AM, Max Lapshin <[hidden email]> wrote:
> Ok, will check with reducing buffer.
>
> We have put 2 MB and even 16 MB because without it, we got packet drops
>
> You could also try raising the sbct limit to over 2 MB. i.e.
>
> %% Raise sbct limit
> +MBsbct 4096 +MBlmbcs 20480
>
> or
>
> %% lower user-space buffer
> Common = [binary,{reuseaddr,true},{buffer,256*1024},{recbuf,2*1024*1024},inet,{read_packets,100},{active,500}],
>
>  
>
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions
>
>
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: UDP receive performance

Max Lapshin-2
It is video, we need to receive packets strictly in order.

Yes, I know that UDP can reorder and lose packets, but in normal conditions with hardware headend and plain C linux software on receiver it is possible to achieve almost 100% of deliverability of packets without reordering. 


We want to achieve something like this with erlang.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: UDP receive performance

Jonas Falkevik
Hi,
I think I was able to reproduce this problem.
And to me it looks like a bug.
On my system the packet_inet_input is called but the recv call returns eagain.
This isn’t freeing the buffer so the realloc (intended for fragments?) are trigged.
Reallocating the same size buffer again.

The below branch with the below change reduced the reallocs on my linux system.


Jonas

On 31 May 2018, at 21:05, Max Lapshin <[hidden email]> wrote:

It is video, we need to receive packets strictly in order.

Yes, I know that UDP can reorder and lose packets, but in normal conditions with hardware headend and plain C linux software on receiver it is possible to achieve almost 100% of deliverability of packets without reordering. 


We want to achieve something like this with erlang.
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: UDP receive performance

Lukas Larsson-8
Hello!

On Wed, Jun 13, 2018 at 11:08 PM Jonas Falkevik <[hidden email]> wrote:
Hi,
I think I was able to reproduce this problem.
And to me it looks like a bug.
On my system the packet_inet_input is called but the recv call returns eagain.
This isn’t freeing the buffer so the realloc (intended for fragments?) are trigged.
Reallocating the same size buffer again.

The below branch with the below change reduced the reallocs on my linux system.



Interesting... I wonder if maybe it wouldn't be better to solve this problem in the code for realloc so that no copy is done when a realloc of the same size if issued... that way we solve it in all places instead of only in the inet_driver

Lukas

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: UDP receive performance

Jonas Falkevik
>
>
> Interesting... I wonder if maybe it wouldn't be better to solve this problem in the code for realloc so that no copy is done when a realloc of the same size if issued... that way we solve it in all places instead of only in the inet_driver
>

That sounds reasonable. :)
Some extra pointer fiddling will still be done, setting the same values. But that is nothing in comparison.

Why select indicates that the socket is ready for reading but would block seems to boil down to performance.

Jonas

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
12