Process scheduling and punishment

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Process scheduling and punishment

Knut Nesheim
Dear list,

We have a case where a gen_server gets "slow" after it has handled
many messages, while what it does stays exactly the same. We suspect
the scheduling of the process changes. I was hoping someone on the
list could shed some light on why this happens and if there is any way
to avoid it.

When repeatedly running the same test suite, after some time we notice
random parts of the test suite getting two orders of magnitude slower.
The tests query our server over HTTP and the roundtrip times goes from
~1ms to 75-100ms at a very sharp point. After this point, it stays at
the same level until the gen_server is restarted. The CPU usage of the
beam process stays around 5-10% and from etop we see no change.

What happens is basically this:
 * From short-lived processes spawned by misultin we query a single
gen_server while measuring wall clock between the point where we send
the message and get the reply. From this point of view, the gen_server
starts out fast, most calls take only a couple of hundred microseconds
to complete.
 * Inside the gen_server we do very little work and from measuring the
wallclock time we spend consistently less than 100 microseconds.
 * Around 10 times per second, from the gen_server we send a message
containing roughly 1000 words to a logging process.
 * At the point where the misultin processes starts measuring the
gen_server as slow, we still spend consistently less than 100
microseconds.
 * At this point, we also see messages(no more than one) in the
message queue of the process, which is weird as end to end we are
sequential so the process has nothing to do but handle these messages.
 * At no point do we see the logging process having messages in the
queue. It is using the same amount of cpu in both states.

Is it the case that our gen_server is "punished" due to overloading
the logging process? Is there any way to measure if the VM considers
our logging process to be overloaded? Is there any general form of
"punishment" for very busy processes that might cause starvation for
our gen_server?

In our live system we have many of these gen_servers, but the request
rate is much lower and they do very little logging(if at all). If it
is the case that our gen_server is punished, what would happen when we
have ten thousand of them? If all servers log at some point in it's
life and one server goes crazy which causes the log process to be
overloaded, will all servers be punished?

Thanks
Knut
--
Engineering
http://www.wooga.com | phone +49 151 57202523 | fax +49-30-8964 9064

wooga GmbH | Saarbruecker Str. 38 | 10405 Berlin | Germany
Sitz der Gesellschaft: Berlin; HRB 117846 B
Registergericht Berlin-Charlottenburg
Geschaeftsfuehrung: Jens Begemann, Philipp Moeser
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Process scheduling and punishment

dmercer
On Friday, July 15, 2011, Knut Nesheim wrote:

Is it possibly related to the

> We have a case where a gen_server gets "slow" after it has handled
> many messages, while what it does stays exactly the same. We suspect
> the scheduling of the process changes. I was hoping someone on the
> list could shed some light on why this happens and if there is any way
> to avoid it.
. . .
>  * Around 10 times per second, from the gen_server we send a message
> containing roughly 1000 words to a logging process.
. . .

>  * At no point do we see the logging process having messages in the
> queue. It is using the same amount of cpu in both states.
>
> Is it the case that our gen_server is "punished" due to overloading
> the logging process? Is there any way to measure if the VM considers
> our logging process to be overloaded? Is there any general form of
> "punishment" for very busy processes that might cause starvation for
> our gen_server?
>
> In our live system we have many of these gen_servers, but the request
> rate is much lower and they do very little logging(if at all).

I'm guessing this is related to the cost of sending being proportional to
the receiver's message queue.  (Ref. last bullet point on
http://www.erlang.org/documentation/doc-4.9.1/erts-4.9.1/notes.html.)  You
do say that you don't see messages accumulating in the logging process's
queue, but I still have the feeling it is related to this.

Cheers,

DBM

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Process scheduling and punishment

Jesper Louis Andersen-2
In reply to this post by Knut Nesheim
On Fri, Jul 15, 2011 at 12:40, Knut Nesheim <[hidden email]> wrote:
> Dear list,
>
> We have a case where a gen_server gets "slow" after it has handled
> many messages, while what it does stays exactly the same. We suspect
> the scheduling of the process changes. I was hoping someone on the
> list could shed some light on why this happens and if there is any way
> to avoid it.

Just for the record... What version of OTP are we talking about here?

--
J.
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Process scheduling and punishment

Knut Nesheim
On Fri, Jul 15, 2011 at 3:35 PM, Jesper Louis Andersen
<[hidden email]> wrote:
> Just for the record... What version of OTP are we talking about here?

Sorry. The version is R14B03.

Regards
Knut
--
Engineering
http://www.wooga.com | phone +49 151 57202523 | fax +49-30-8964 9064

wooga GmbH | Saarbruecker Str. 38 | 10405 Berlin | Germany
Sitz der Gesellschaft: Berlin; HRB 117846 B
Registergericht Berlin-Charlottenburg
Geschaeftsfuehrung: Jens Begemann, Philipp Moeser
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Process scheduling and punishment

Knut Nesheim
In reply to this post by dmercer
On Fri, Jul 15, 2011 at 1:30 PM, David Mercer <[hidden email]> wrote:
> I'm guessing this is related to the cost of sending being proportional to
> the receiver's message queue.  (Ref. last bullet point on
> http://www.erlang.org/documentation/doc-4.9.1/erts-4.9.1/notes.html.)  You
> do say that you don't see messages accumulating in the logging process's
> queue, but I still have the feeling it is related to this.
>

Thanks. This sounds like a possible explanation. Do you know if there
is any way to measure/understand which process is slowed down?

Regards
Knut
--
Engineering
http://www.wooga.com | phone +49 151 57202523 | fax +49-30-8964 9064

wooga GmbH | Saarbruecker Str. 38 | 10405 Berlin | Germany
Sitz der Gesellschaft: Berlin; HRB 117846 B
Registergericht Berlin-Charlottenburg
Geschaeftsfuehrung: Jens Begemann, Philipp Moeser
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Process scheduling and punishment

Michael Truog
In reply to this post by Knut Nesheim
One way that can happen is if lots of binaries are created faster than the garbage collector can easily consume them.  Then memory consumption should be higher and the process should be slower.  Generally, the way to deal with that, is handling all the binaries within a spawned (linked) process which is short-lived, to force (i.e., encourage) more immediate garbage collection.  This code forces the garbage collection as much as possible https://github.com/okeuday/CloudI/blob/master/src/lib/unused/src/immediate_gc.erl , however, only testing would determine that such an extreme is necessary.

- Michael

On 07/15/2011 03:40 AM, Knut Nesheim wrote:

> Dear list,
>
> We have a case where a gen_server gets "slow" after it has handled
> many messages, while what it does stays exactly the same. We suspect
> the scheduling of the process changes. I was hoping someone on the
> list could shed some light on why this happens and if there is any way
> to avoid it.
>
> When repeatedly running the same test suite, after some time we notice
> random parts of the test suite getting two orders of magnitude slower.
> The tests query our server over HTTP and the roundtrip times goes from
> ~1ms to 75-100ms at a very sharp point. After this point, it stays at
> the same level until the gen_server is restarted. The CPU usage of the
> beam process stays around 5-10% and from etop we see no change.
>
> What happens is basically this:
>  * From short-lived processes spawned by misultin we query a single
> gen_server while measuring wall clock between the point where we send
> the message and get the reply. From this point of view, the gen_server
> starts out fast, most calls take only a couple of hundred microseconds
> to complete.
>  * Inside the gen_server we do very little work and from measuring the
> wallclock time we spend consistently less than 100 microseconds.
>  * Around 10 times per second, from the gen_server we send a message
> containing roughly 1000 words to a logging process.
>  * At the point where the misultin processes starts measuring the
> gen_server as slow, we still spend consistently less than 100
> microseconds.
>  * At this point, we also see messages(no more than one) in the
> message queue of the process, which is weird as end to end we are
> sequential so the process has nothing to do but handle these messages.
>  * At no point do we see the logging process having messages in the
> queue. It is using the same amount of cpu in both states.
>
> Is it the case that our gen_server is "punished" due to overloading
> the logging process? Is there any way to measure if the VM considers
> our logging process to be overloaded? Is there any general form of
> "punishment" for very busy processes that might cause starvation for
> our gen_server?
>
> In our live system we have many of these gen_servers, but the request
> rate is much lower and they do very little logging(if at all). If it
> is the case that our gen_server is punished, what would happen when we
> have ten thousand of them? If all servers log at some point in it's
> life and one server goes crazy which causes the log process to be
> overloaded, will all servers be punished?
>
> Thanks
> Knut

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions