300k HTTP GET RPS on Erlang 21.2, benchmarking per scheduler polling

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

300k HTTP GET RPS on Erlang 21.2, benchmarking per scheduler polling

Vans S
If anyone is interested here is the writeup  https://elixirforum.com/t/300k-requests-per-second-webserver-in-elixir-otp21-2-10-cores/18823.

tl;dr; About 22% of time the scheduler spent in poll. to serve 30k~ HTTP Get requests. I think its a little much still?

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: 300k HTTP GET RPS on Erlang 21.2, benchmarking per scheduler polling

Lukas Larsson-8

Hello,

On Tue, Dec 18, 2018 at 2:36 AM Vans S <[hidden email]> wrote:
If anyone is interested here is the writeup  https://elixirforum.com/t/300k-requests-per-second-webserver-in-elixir-otp21-2-10-cores/18823.

tl;dr; About 22% of time the scheduler spent in poll. to serve 30k~ HTTP Get requests. I think its a little much still?

Firstly, it is not poll that you spend 22% in, it is PORT, i.e. the work done by gen_tcp to call writev/read. Polling shows up in the state CHECK_IO. The optimizations introduced in 21.2 were mainly done to reduce the time spent doing polling.

Secondly, I'd say it is too little. As you saw in the edit that you made, if your remove/optimize the Erlang parts you will get higher a throughput rate as the system can spend more time doing port work. What you are seeing as OTHER is most likely the system spinning looking for work to do. You can get more states if you are interested in digging deeper by passing --with-microstate-accounting=extra to configure.

The inet_driver (the port driver that is used for TCP/UDP/SCTP) is not perfect, but almost 2 decades have been spent improving it, so there are very few low hanging fruits left.

Lukas

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: 300k HTTP GET RPS on Erlang 21.2, benchmarking per scheduler polling

Vans S
I think OTHER is something to do with Ports / polling, because I just removed the inet_drv and wrote a simple c nif to do TCP networking, and the throughput doubled. I did not get around to recompiling erlang with microstate accounting but without inet driver using an unoptimized nonblocking tcp nif I got the msacc report to look like

 scheduler( 1)    0.68%    0.00%   89.85%    3.46%    6.01%    0.00%    0.00%
 scheduler( 2)    0.66%    0.01%   90.43%    3.40%    5.50%    0.00%    0.00%

Using 2 schedulers because 10 physical cores generating load now just barely fully saturated.  now 90% of the time is spent in emulator, 6% is other, I am guessing 6% other is the NIF calls to the socket calls?

The throughput was 250k for 2 physical cores.  If all scales linearly that is 1.25m RPS for simple GET hello world benchmark.

The NIF is PoC https://gist.github.com/vans163/d96fcc7c89d0cf25c819c5fb77769e81 ofcourse its only useful in the case there is constant data on socket, otherwise this PoC will break if there is idle connections that keep getting polled.  This opens the possibility though to using something like DPDK.
On Tuesday, December 18, 2018, 2:32:21 a.m. EST, Lukas Larsson <[hidden email]> wrote:



Hello,

On Tue, Dec 18, 2018 at 2:36 AM Vans S <[hidden email]> wrote:
If anyone is interested here is the writeup  https://elixirforum.com/t/300k-requests-per-second-webserver-in-elixir-otp21-2-10-cores/18823.

tl;dr; About 22% of time the scheduler spent in poll. to serve 30k~ HTTP Get requests. I think its a little much still?

Firstly, it is not poll that you spend 22% in, it is PORT, i.e. the work done by gen_tcp to call writev/read. Polling shows up in the state CHECK_IO. The optimizations introduced in 21.2 were mainly done to reduce the time spent doing polling.

Secondly, I'd say it is too little. As you saw in the edit that you made, if your remove/optimize the Erlang parts you will get higher a throughput rate as the system can spend more time doing port work. What you are seeing as OTHER is most likely the system spinning looking for work to do. You can get more states if you are interested in digging deeper by passing --with-microstate-accounting=extra to configure.

The inet_driver (the port driver that is used for TCP/UDP/SCTP) is not perfect, but almost 2 decades have been spent improving it, so there are very few low hanging fruits left.

Lukas

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: 300k HTTP GET RPS on Erlang 21.2, benchmarking per scheduler polling

Fred Hebert-2
On 12/18, Vans S wrote:

> I think OTHER is something to do with Ports / polling, because I just
> removed the inet_drv and wrote a simple c nif to do TCP networking,
> and the throughput doubled. I did not get around to recompiling erlang
> with microstate accounting but without inet driver using an
> unoptimized nonblocking tcp nif I got the msacc report to look like
>
>Using 2 schedulers because 10 physical cores generating load now just
>barely fully saturated.  now 90% of the time is spent in emulator, 6%
>is other, I am guessing 6% other is the NIF calls to the socket calls?
>
>The throughput was 250k for 2 physical cores.  If all scales linearly
>that is 1.25m RPS for simple GET hello world benchmark.
>
>The NIF is
>PoC https://gist.github.com/vans163/d96fcc7c89d0cf25c819c5fb77769e81 ofcourse
>its only useful in the case there is constant data on socket, otherwise
>this PoC will break if there is idle connections that keep getting
>polled.  This opens the possibility though to using something like
>DPDK.

I think you might have achieved this:

https://twitter.com/seldo/status/800817973921386497

Chapter 15: 300% performance boosts by deleting data validity checks

Of course, the driver may have a higher baseline overhead than a NIF,
but you also got rid of all validation and handling of any edge case
whatsoever.

You claim your NIF is not optimized, but it is _extremely_ optimized:
you removed all code that could have been useful for scenarios that are
not the one you are actively testing, therefore getting rid of all their
overheads.

And you are doing so on a rather useless benchmark: hello worlds that
parse nothing and therefore have nothing in common with any application
in the wild that might care about the request's content. The benchmark
results you get can therefore not be extrapolated to be relevant to any
application out there.

I would likely urge you, unless you are doing this for the fun of
micro-optimizing edge cases, to consider basing your work on more
representative benchmarks.
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: 300k HTTP GET RPS on Erlang 21.2, benchmarking per scheduler polling

Vans S

Removing unoptimised validity checks (parsing the request) got about a 2x speed up. from 300k to 600k~.

Removing the inet_drv and replacing it with a very innefficient poll every 10ms C NIF that was polling from vm processes on 6000 connections + not doing validity checks got 1.25m~ theoretical as I ran out of cores to benchmark from.  So say a 2x speed up from using the inet_drv.  This is not optimized, a optimized C NIF for polling would use edge polling on groups of descriptors to avoid a kernel call every 10 ms + the enter C NIF overhead per connection.

What I did remove in the inet_drv was some what I consider useless guarantees but guarantees that result in 1 extra memory copy / allocation plus extra cpu overhead.  For example inet_drv provides a guarantee that send will ALWAYS send all the bytes you pass it, keeping a copy of the buffer. I removed this gaurantee I dont think its appropriate for performance, the process doing the send should just keep track of the buffer, if a send results in a partial send, return the amount of bytes that was sent and allow the caller to mark its buffer.

Also indeed the benchmark is not appropriate for 99.9% of the workloads non-cdn and non-caching-based web servers do, where the overhead of the work to generate the response itself will far outweight the overhead of parsing the request, BUT it is never the less interesting to see just how low the VM overhead is. (I dont consider inet_drv as part of VM when I refer to VM)

On Wednesday, December 19, 2018, 8:04:17 a.m. EST, Fred Hebert <[hidden email]> wrote:


On 12/18, Vans S wrote:

> I think OTHER is something to do with Ports / polling, because I just
> removed the inet_drv and wrote a simple c nif to do TCP networking,
> and the throughput doubled. I did not get around to recompiling erlang
> with microstate accounting but without inet driver using an
> unoptimized nonblocking tcp nif I got the msacc report to look like
>
>Using 2 schedulers because 10 physical cores generating load now just
>barely fully saturated.  now 90% of the time is spent in emulator, 6%
>is other, I am guessing 6% other is the NIF calls to the socket calls?
>
>The throughput was 250k for 2 physical cores.  If all scales linearly
>that is 1.25m RPS for simple GET hello world benchmark.
>
>The NIF is
>PoC <a shape="rect" href="https://gist.github.com/vans163/d96fcc7c89d0cf25c819c5fb77769e81 ofcourse " target="_blank">https://gist.github.com/vans163/d96fcc7c89d0cf25c819c5fb77769e81 ofcourse
>its only useful in the case there is constant data on socket, otherwise
>this PoC will break if there is idle connections that keep getting
>polled.  This opens the possibility though to using something like
>DPDK.

I think you might have achieved this:

https://twitter.com/seldo/status/800817973921386497

Chapter 15: 300% performance boosts by deleting data validity checks

Of course, the driver may have a higher baseline overhead than a NIF,
but you also got rid of all validation and handling of any edge case
whatsoever.

You claim your NIF is not optimized, but it is _extremely_ optimized:
you removed all code that could have been useful for scenarios that are
not the one you are actively testing, therefore getting rid of all their
overheads.

And you are doing so on a rather useless benchmark: hello worlds that
parse nothing and therefore have nothing in common with any application
in the wild that might care about the request's content. The benchmark
results you get can therefore not be extrapolated to be relevant to any
application out there.

I would likely urge you, unless you are doing this for the fun of
micro-optimizing edge cases, to consider basing your work on more
representative benchmarks.


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: 300k HTTP GET RPS on Erlang 21.2, benchmarking per scheduler polling

Frank Muller
In reply to this post by Fred Hebert-2
Comparing apples to oranges will certainly mislead you and bias your benchmark.

We always found Cowboy faster than Elli, Yaws or Misultin to name a few.

We switched to Cowboy in prod few years ago and we didn’t regret it.

The author is doing a great job maintaining it and making the API consistent, and the doc up to date.

Yes, it’s slower than Nginx but who really cares? Cowboy scales well and it’s in pure Erlang.

Merry Christmas
/Frank

On 12/18, Vans S wrote:
> I think OTHER is something to do with Ports / polling, because I just
> removed the inet_drv and wrote a simple c nif to do TCP networking,
> and the throughput doubled. I did not get around to recompiling erlang
> with microstate accounting but without inet driver using an
> unoptimized nonblocking tcp nif I got the msacc report to look like
>
>Using 2 schedulers because 10 physical cores generating load now just
>barely fully saturated.  now 90% of the time is spent in emulator, 6%
>is other, I am guessing 6% other is the NIF calls to the socket calls?
>
>The throughput was 250k for 2 physical cores.  If all scales linearly
>that is 1.25m RPS for simple GET hello world benchmark.
>
>The NIF is
>PoC https://gist.github.com/vans163/d96fcc7c89d0cf25c819c5fb77769e81 ofcourse
>its only useful in the case there is constant data on socket, otherwise
>this PoC will break if there is idle connections that keep getting
>polled.  This opens the possibility though to using something like
>DPDK.

I think you might have achieved this:

https://twitter.com/seldo/status/800817973921386497

Chapter 15: 300% performance boosts by deleting data validity checks

Of course, the driver may have a higher baseline overhead than a NIF,
but you also got rid of all validation and handling of any edge case
whatsoever.

You claim your NIF is not optimized, but it is _extremely_ optimized:
you removed all code that could have been useful for scenarios that are
not the one you are actively testing, therefore getting rid of all their
overheads.

And you are doing so on a rather useless benchmark: hello worlds that
parse nothing and therefore have nothing in common with any application
in the wild that might care about the request's content. The benchmark
results you get can therefore not be extrapolated to be relevant to any
application out there.

I would likely urge you, unless you are doing this for the fun of
micro-optimizing edge cases, to consider basing your work on more
representative benchmarks.
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions