|
123
|
Hi! We have a performance problem receiving lots of UDP traffic. There are a lot (about 70) of UDP receive processes, each handling about 1 to 10 megabits of multicast traffic, with {active, N}. msacc summary on my OSX laptop, build from OTP master c30309e799212b080c39ee2f91af3f9a0383d767 (Apr 19):
Thread alloc aux bifbusy_wait check_io emulator ets gc gc_full nif other port send sleep timers
scheduler 30.02% 0.92% 2.86% 24.66% 0.01% 9.61% 0.03% 1.25% 0.20% 0.13% 2.34% 9.33% 0.41% 17.78% 0.44%
Linux production server behaves the same way (we do not have extended msacc there yet, so most of alloc goes to port).
perf top (on Linux production) says there's a lot of unaligned memmove: 69.76% libc-2.24.so [.] __memmove_sse2_unaligned_erms
6.13% beam.smp [.] process_main
2.02% beam.smp [.] erts_schedule
0.87% [kernel] [k] copy_user_enhanced_fast_string
I'll try to make a minimal example for this. Maybe there are simple recommendations on optimizing this kind of load?
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
On Wed, May 23, 2018 at 06:28:55PM +0300, Danil Zagoskin wrote:
> Hi!
>
> We have a performance problem receiving lots of UDP traffic.
> There are a lot (about 70) of UDP receive processes, each handling about 1
> to 10 megabits of multicast traffic, with {active, N}.
Whenever someone has UDP receive performance problems one has to ask if you
have seen the Erlang socket option {read_packets,N}?
See http://erlang.org/doc/man/inet.html#setopts-2>
> msacc summary on my OSX laptop, build from OTP master
> c30309e799212b080c39ee2f91af3f9a0383d767 (Apr 19):
>
>
> Thread alloc aux bifbusy_wait check_io emulator
> ets gc gc_full nif other port send sleep
> timers
> scheduler 30.02% 0.92% 2.86% 24.66% 0.01% 9.61%
> 0.03% 1.25% 0.20% 0.13% 2.34% 9.33% 0.41% 17.78%
> 0.44%
>
>
> Linux production server behaves the same way (we do not have extended msacc
> there yet, so most of alloc goes to port).
>
> perf top (on Linux production) says there's a lot of unaligned memmove:
>
> 69.76% libc-2.24.so [.] __memmove_sse2_unaligned_erms
> 6.13% beam.smp [.] process_main
> 2.02% beam.smp [.] erts_schedule
> 0.87% [kernel] [k] copy_user_enhanced_fast_string
>
>
> I'll try to make a minimal example for this.
> Maybe there are simple recommendations on optimizing this kind of load?
>
> --
> Danil Zagoskin | [hidden email]
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions--
/ Raimo Niskanen, Erlang/OTP, Ericsson AB
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
OTP-21 rc1 has enhanced IO scalability. Have you tried if it is any better? UDP performance in Erlang was never great...
Regards, Sergej
Yes, we have {read_packets, 100} in receive socket options.
_______________________________________________ erlang-questions mailing list [hidden email]http://erlang.org/mailman/listinfo/erlang-questions
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
On Wed, May 23, 2018 at 5:29 PM Danil Zagoskin < [hidden email]> wrote: Hi! We have a performance problem receiving lots of UDP traffic. There are a lot (about 70) of UDP receive processes, each handling about 1 to 10 megabits of multicast traffic, with {active, N}.
Suppose you read the packets, and then throw everything away, as a test. Are you then fast enough, or do you have a problem still? Chances are that the problem isn't the reception by itself.
memmove just means you are moving memory around a lot, but the question is: why?
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
I've made a simple example. The code is at https://gist.github.com/stolen/40eebd6225faf821153f1eeb5374f068
I added the 239.0.0.0/8 route as direct loopback (lo0 in OSX) to avoid network driver overhead.
40 dummy readers are enough to eat almost all 4 cores on my macbook (quite old i7 2 GHz):
Erlang/OTP 21 [RELEASE CANDIDATE 1] [erts-9.3.1] [source-dfc935298b] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe]
1> c(udptest).{ok,udptest} 2> udptest:start_sender({239,9,9,9}, 3999). <0.82.0> 3> [udptest:start_reader({239,9,9,9}, 3999) || _ <- lists:seq(1, 40)]. [<0.84.0>,<0.85.0>,<0.86.0>,<0.87.0>,<0.88.0>,<0.89.0>, <0.90.0>,<0.91.0>,<0.92.0>,<0.93.0>,<0.94.0>,<0.95.0>, <0.96.0>,<0.97.0>,<0.98.0>,<0.99.0>,<0.100.0>,<0.101.0>, <0.102.0>,<0.103.0>,<0.104.0>,<0.105.0>,<0.106.0>,<0.107.0>, <0.108.0>,<0.109.0>,<0.110.0>,<0.111.0>,<0.112.0>|...] 4> msacc:start(10000), msacc:print(). ... Thread alloc aux bifbusy_wait check_io emulator ets gc gc_full nif other port send sleep timers ... scheduler 59.95% 0.62% 0.14% 10.65% 0.00% 1.40% 0.00% 0.39% 0.00% 0.00% 1.74% 17.43% 0.00% 7.62% 0.06%
I'll build the fresh OTP on Linux box and check perf again.
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
We ran into the same performance problem.
Now we are trying to offload the UDP packets receiver and accumulator in C using nif and pass on the complete message to Erlang. Not sure if we this will be good alternative.
Problem is that we need these udp packets to become one single big packet. Data flow of small 1500 byte packets is killing us =(
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
- 2.57% 0.02% 31_scheduler beam.smp [.] process_main - 2.55% process_main - 2.43% erts_schedule - 2.26% erts_port_task_execute - 2.25% packet_inet_input.isra.31 - 2.05% driver_realloc_binary - 2.05% realloc_thr_pref 1.87% __memmove_avx_unaligned_erms That's 40-core Xeon E5-2640, so 2.5% on a single scheduler is kind of 100%
Also it's Linux kernel 4.9On a machine with kernel 4.13 and quad-core Xeon E31225 on a half of E5's load we have:- 16.11% 0.10% 1_scheduler beam.smp [.] erts_schedule - 16.01% erts_schedule - 13.62% erts_port_task_execute - 13.11% packet_inet_input.isra.31 - 11.37% driver_realloc_binary - 11.33% realloc_thr_pref - 10.50% __memcpy_avx_unaligned 5.06% __memcpy_avx_unaligned + 1.04% page_fault 0.66% do_erts_alcu_realloc.constprop.31 + 0.79% 0x108f3 0.55% driver_deliver_term 1.30% sched_spin_wait
Seems like kernel version may change a lot, will run more tests.
But it seems like memory operations are unaligned which could be not very efficient.
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
I looked up the unaligned stuff. There are no aligned variant, and the unaligned variant just sets up a prologue before entering the main loop where you do have alignment. So I wouldn't worry about that, but more about where the calls are being made and where the memory is copied around. On Thu, May 24, 2018 at 5:35 PM Danil Zagoskin < [hidden email]> wrote: - 2.57% 0.02% 31_scheduler beam.smp [.] process_main - 2.55% process_main - 2.43% erts_schedule - 2.26% erts_port_task_execute - 2.25% packet_inet_input.isra.31 - 2.05% driver_realloc_binary - 2.05% realloc_thr_pref 1.87% __memmove_avx_unaligned_erms That's 40-core Xeon E5-2640, so 2.5% on a single scheduler is kind of 100%
Also it's Linux kernel 4.9On a machine with kernel 4.13 and quad-core Xeon E31225 on a half of E5's load we have:- 16.11% 0.10% 1_scheduler beam.smp [.] erts_schedule - 16.01% erts_schedule - 13.62% erts_port_task_execute - 13.11% packet_inet_input.isra.31 - 11.37% driver_realloc_binary - 11.33% realloc_thr_pref - 10.50% __memcpy_avx_unaligned 5.06% __memcpy_avx_unaligned + 1.04% page_fault 0.66% do_erts_alcu_realloc.constprop.31 + 0.79% 0x108f3 0.55% driver_deliver_term 1.30% sched_spin_wait
Seems like kernel version may change a lot, will run more tests.
But it seems like memory operations are unaligned which could be not very efficient.
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
Have you considered opening multiple sockets from multiple processes on the same port? It seems that your major problem was with high CPU usage so I'm not sure that technique would help--unless you were maxing out only a single core. On BSD and recent Linux kernels this will result in incoming packets being load-balanced across the sockets.
It's not directly supported in Erlang, but the "raw" option lets you do it. I've got it working on OS X, though I've not yet load tested it. I haven't had the chance to verify the Linux branch, but here it is (the following is Elixir, not Erlang, but close enough):
def reuseport_option() do
case :os.type() do
{:unix, name} ->
cond do
name in [:darwin, :freebsd, :openbsd, :netbsd] ->
[{:raw, 0xffff, 0x0200, <<1::native-integer-size(32)>>} ]
name in [:linux] ->
[{:raw, 1, 15, <<1::native-integer-size(32)>>} ]
false ->
[]
end
_ ->
[]
end
end
Just append that result to your options. (Of course, if you run on a platform which is not supported here, if you try to open second and subsequent sockets you'll get errors, so use the [] return as a "can't do this" flag if there's any chance you'll run on such a platform.)
--
Scott Ribe
[hidden email]
https://www.linkedin.com/in/scottribe/> On May 26, 2018, at 12:17 AM, Lukas Larsson < [hidden email]> wrote:
>
>
>
> On Sat, May 26, 2018 at 3:38 AM, Max Lapshin < [hidden email]> wrote:
> Ok, will check with reducing buffer.
>
> We have put 2 MB and even 16 MB because without it, we got packet drops
>
> You could also try raising the sbct limit to over 2 MB. i.e.
>
> %% Raise sbct limit
> +MBsbct 4096 +MBlmbcs 20480
>
> or
>
> %% lower user-space buffer
> Common = [binary,{reuseaddr,true},{buffer,256*1024},{recbuf,2*1024*1024},inet,{read_packets,100},{active,500}],
>
>
>
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions>
>
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
Hi, I think I was able to reproduce this problem. And to me it looks like a bug. On my system the packet_inet_input is called but the recv call returns eagain. This isn’t freeing the buffer so the realloc (intended for fragments?) are trigged. Reallocating the same size buffer again.
The below branch with the below change reduced the reallocs on my linux system.
Jonas
_______________________________________________ erlang-questions mailing list [hidden email]http://erlang.org/mailman/listinfo/erlang-questions
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
Hello! On Wed, Jun 13, 2018 at 11:08 PM Jonas Falkevik < [hidden email]> wrote: Hi, I think I was able to reproduce this problem. And to me it looks like a bug. On my system the packet_inet_input is called but the recv call returns eagain. This isn’t freeing the buffer so the realloc (intended for fragments?) are trigged. Reallocating the same size buffer again.
The below branch with the below change reduced the reallocs on my linux system.
Interesting... I wonder if maybe it wouldn't be better to solve this problem in the code for realloc so that no copy is done when a realloc of the same size if issued... that way we solve it in all places instead of only in the inet_driver
Lukas
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
|
>
>
> Interesting... I wonder if maybe it wouldn't be better to solve this problem in the code for realloc so that no copy is done when a realloc of the same size if issued... that way we solve it in all places instead of only in the inet_driver
>
That sounds reasonable. :)
Some extra pointer fiddling will still be done, setting the same values. But that is nothing in comparison.
Why select indicates that the socket is ready for reading but would block seems to boil down to performance.
Jonas
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
|
123
|