Debugging scheduler not responding to erts_schedule_misc_aux_work

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Debugging scheduler not responding to erts_schedule_misc_aux_work

Maxim Fedorov

I’m trying to debug some weird condition when any misc system task hangs.

It seems to affect OTP 20 (but not 16) on FreeBSD 10.3 and 11.

 

It is a rare problem happening after 5-7 days under some load (~40% cpu average on a 48 cores server).

 

There is also a problem with erlang:statistics(runtime), affected by this bug in FreeBSD kernel: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=227689 (so statistics:runtime() always returns the same value), however I doubt it can affect anything.

 

What happens:  there are several calls, e.g. erlang:statistics(garbage_collection), ets:all(), erts_internal:system_check() and few more. All of them do erts_schedule_misc_aux_work. A misc aux work item is put into every scheduler queue, and it seems that all of them except one respond. VM is still working, all other processes are fine, but the one that did the call is waiting in erlang:gc_info/2 (or another corresponding function), with counter equals to 1. Since there is no timeout in receive statement, it waits forever.

 

How do I debug this? Is there any way to find a scheduler that misbehaves? It is one of the normal schedulers. I’m using gdb to attach to BEAM VM.

 

Unfortunately, I cannot run debug VM (it is not able to handle the load).


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Debugging scheduler not responding to erts_schedule_misc_aux_work

Anthony Molinaro-4
It may or may not apply but we had a similar problem with system level
work being scheduled on freed processes, and have been debugging it with
Ericsson via

https://bugs.erlang.org/browse/ERL-573

for the last few months.  There's a branch of 20 in one of the later
comments which might help, and might be worth a try.

HTH,

-Anthony


> I’m trying to debug some weird condition when any misc system task hangs.
> It seems to affect OTP 20 (but not 16) on FreeBSD 10.3 and 11.
>
> It is a rare problem happening after 5-7 days under some load (~40% cpu
> average on a 48 cores server).
>
> There is also a problem with erlang:statistics(runtime), affected by this
> bug in FreeBSD kernel:
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=227689 (so
> statistics:runtime() always returns the same value), however I doubt it
> can affect anything.
>
> What happens:  there are several calls, e.g.
> erlang:statistics(garbage_collection), ets:all(),
> erts_internal:system_check() and few more. All of them do
> erts_schedule_misc_aux_work. A misc aux work item is put into every
> scheduler queue, and it seems that all of them except one respond. VM is
> still working, all other processes are fine, but the one that did the call
> is waiting in erlang:gc_info/2 (or another corresponding function), with
> counter equals to 1. Since there is no timeout in receive statement, it
> waits forever.
>
> How do I debug this? Is there any way to find a scheduler that misbehaves?
> It is one of the normal schedulers. I’m using gdb to attach to BEAM VM.
>
> Unfortunately, I cannot run debug VM (it is not able to handle the load).
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions
>


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Debugging scheduler not responding to erts_schedule_misc_aux_work

Anthony Molinaro-4
And it looks like this got merged into 20.3.8.2 released today, so you can try that tag to see if it fixes your problem.

HTH,

-Anthony

> On Jun 28, 2018, at 6:13 PM, [hidden email] wrote:
>
> It may or may not apply but we had a similar problem with system level
> work being scheduled on freed processes, and have been debugging it with
> Ericsson via
>
> https://bugs.erlang.org/browse/ERL-573
>
> for the last few months.  There's a branch of 20 in one of the later
> comments which might help, and might be worth a try.
>
> HTH,
>
> -Anthony
>
>
>> I’m trying to debug some weird condition when any misc system task hangs.
>> It seems to affect OTP 20 (but not 16) on FreeBSD 10.3 and 11.
>>
>> It is a rare problem happening after 5-7 days under some load (~40% cpu
>> average on a 48 cores server).
>>
>> There is also a problem with erlang:statistics(runtime), affected by this
>> bug in FreeBSD kernel:
>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=227689 (so
>> statistics:runtime() always returns the same value), however I doubt it
>> can affect anything.
>>
>> What happens:  there are several calls, e.g.
>> erlang:statistics(garbage_collection), ets:all(),
>> erts_internal:system_check() and few more. All of them do
>> erts_schedule_misc_aux_work. A misc aux work item is put into every
>> scheduler queue, and it seems that all of them except one respond. VM is
>> still working, all other processes are fine, but the one that did the call
>> is waiting in erlang:gc_info/2 (or another corresponding function), with
>> counter equals to 1. Since there is no timeout in receive statement, it
>> waits forever.
>>
>> How do I debug this? Is there any way to find a scheduler that misbehaves?
>> It is one of the normal schedulers. I’m using gdb to attach to BEAM VM.
>>
>> Unfortunately, I cannot run debug VM (it is not able to handle the load).
>> _______________________________________________
>> erlang-questions mailing list
>> [hidden email]
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>
>
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Debugging scheduler not responding to erts_schedule_misc_aux_work

Dave Cottlehuber-5
In reply to this post by Maxim Fedorov
On Thu, 28 Jun 2018, at 22:35, Maxim Fedorov wrote:
> I’m trying to debug some weird condition when any misc system task hangs.
> It seems to affect OTP 20 (but not 16) on FreeBSD 10.3 and 11.
>
> It is a rare problem happening after 5-7 days under some load (~40% cpu
> average on a 48 cores server).
>
> There is also a problem with erlang:statistics(runtime), affected by
> this bug in FreeBSD kernel:
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=227689 (so
> statistics:runtime() always returns the same value), however I doubt it
> can affect anything.
>
> What happens:  there are several calls, e.g.
> erlang:statistics(garbage_collection), ets:all(),
> erts_internal:system_check() and few more. All of them do
> erts_schedule_misc_aux_work. A misc aux work item is put into every
> scheduler queue, and it seems that all of them except one respond. VM is
> still working, all other processes are fine, but the one that did the
> call is waiting in erlang:gc_info/2 (or another corresponding function),
> with counter equals to 1. Since there is no timeout in receive
> statement, it waits forever.
>
> How do I debug this?

I get that your load in production is high, but will a targeted dtrace probe be lightweight enough?

Also wrt Anthony’s comments are you building with clang or gcc? I’m not clear if that’s relevant

A+
Dave








_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Debugging scheduler not responding to erts_schedule_misc_aux_work

Maxim Fedorov
In reply to this post by Anthony Molinaro-4
Thanks, Anthony.

This patch (and Erlang 20.3.8, ERTS 9.3.3.1) does not fix the problem. Same issue has been observed (2 servers out of 384 stopped responding to aux work). I believe it's a different issue.

On 6/28/18, 18:14, "[hidden email]" <[hidden email]> wrote:

    It may or may not apply but we had a similar problem with system level
    work being scheduled on freed processes, and have been debugging it with
    Ericsson via
   
    https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.erlang.org_browse_ERL-2D573&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=0rT1IQzTc-5vHQf8ht6-x-ib_QvAEDJvef2Q98CuKgI&m=To7mEUnu73Ram_9CzpEXWHPvoyDBDz5HDUpb2LyN8p0&s=-TFczFM6WlUROTl7JqcN8u6DmWN3djIS9xhIBlNWgJc&e=
   
    for the last few months.  There's a branch of 20 in one of the later
    comments which might help, and might be worth a try.
   
    HTH,
   
    -Anthony
   
   
    > I’m trying to debug some weird condition when any misc system task hangs.
    > It seems to affect OTP 20 (but not 16) on FreeBSD 10.3 and 11.
    >
    > It is a rare problem happening after 5-7 days under some load (~40% cpu
    > average on a 48 cores server).
    >
    > There is also a problem with erlang:statistics(runtime), affected by this
    > bug in FreeBSD kernel:
    > https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.freebsd.org_bugzilla_show-5Fbug.cgi-3Fid-3D227689&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=0rT1IQzTc-5vHQf8ht6-x-ib_QvAEDJvef2Q98CuKgI&m=To7mEUnu73Ram_9CzpEXWHPvoyDBDz5HDUpb2LyN8p0&s=9ADTx_DQfyQVR8StEXx8AlI49seSh0G6lW-yqem4sIg&e= (so
    > statistics:runtime() always returns the same value), however I doubt it
    > can affect anything.
    >
    > What happens:  there are several calls, e.g.
    > erlang:statistics(garbage_collection), ets:all(),
    > erts_internal:system_check() and few more. All of them do
    > erts_schedule_misc_aux_work. A misc aux work item is put into every
    > scheduler queue, and it seems that all of them except one respond. VM is
    > still working, all other processes are fine, but the one that did the call
    > is waiting in erlang:gc_info/2 (or another corresponding function), with
    > counter equals to 1. Since there is no timeout in receive statement, it
    > waits forever.
    >
    > How do I debug this? Is there any way to find a scheduler that misbehaves?
    > It is one of the normal schedulers. I’m using gdb to attach to BEAM VM.
    >
    > Unfortunately, I cannot run debug VM (it is not able to handle the load).
    > _______________________________________________
    > erlang-questions mailing list
    > [hidden email]
    > https://urldefense.proofpoint.com/v2/url?u=http-3A__erlang.org_mailman_listinfo_erlang-2Dquestions&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=0rT1IQzTc-5vHQf8ht6-x-ib_QvAEDJvef2Q98CuKgI&m=To7mEUnu73Ram_9CzpEXWHPvoyDBDz5HDUpb2LyN8p0&s=v-L1XxeW7SHFxLJvBtf9oVk7FY7XgrzBtT7WvHaYEmw&e=
    >
   
   
   

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Debugging scheduler not responding to erts_schedule_misc_aux_work

Maxim Fedorov
Confirming it's a different issue.

I'm getting one scheduler stuck (waiting infinitely):

Thread 80 (Thread 802010800 (LWP 100332/beam.smp)):
#0  0x000000080136e13c in pthread_cleanup_pop () from /lib/libthr.so.3
#1  0x00000000005f664e in wait__ (e=0x84a801520, spincount=<value optimized out>, timeout=<value optimized out>) at pthread/ethr_event.c:468
#2  0x00000000004db9b1 in erts_schedule (esdp=<value optimized out>, p=<value optimized out>, calls=<value optimized out>) at erl_threads.h:3406
#3  0x0000000000433cdc in process_main (x_reg_array=0x84d04edc0, f_reg_array=0x84d050e40) at beam/beam_emu.c:1349
#4  0x00000000004d4c4b in sched_thread_func (vesdp=0x84b413fc0) at beam/erl_process.c:8912
#5  0x00000000005f5fac in thr_wrapper (vtwd=0x7fffffffdb08) at pthread/ethread.c:118
#6  0x0000000801362855 in pthread_create () from /lib/libthr.so.3
#7  0x0000000000000000 in ?? ()

This happens when there are two outstanding requests:

erlang:statistics(garbage_collection)
erlang:statistics(scheduler_wall_time)

Both of these are stuck waiting for a response from scheduler #28.

--- Scheduler 28 ---
 IX: 27
 CPU Binding: 27
 Aux work Flags: delayed-dealloc-thr-prgr fix-alloc-dealloc fix-alloc-lower-lim async-ready GARBAGE
 Sleep Info Flags:
 Pointer: (ErtsSchedulerData *) 0x84b413fc0
 - Run Queue -
  Length: total=1, max=0, high=0, normal=1, low=0, port=0
  Misc Jobs: no
  Queue Mask: normal
  Misc Flags: out-of-work non-empty
  Pointer: (ErtsRunQueue *) 0x84aa4c280

Originally there was nothing in this scheduler queue, but I forcefully bound a process to it to see if scheduler is stuck on run queue lock (with process_flag(scheduler, 28)).

So I guess it's a rare bug I eventually hit, which is different from a message being sent to exiting process.

On 7/11/18, 09:30, "[hidden email] on behalf of Maxim Fedorov" <[hidden email] on behalf of [hidden email]> wrote:

    Thanks, Anthony.
   
    This patch (and Erlang 20.3.8, ERTS 9.3.3.1) does not fix the problem. Same issue has been observed (2 servers out of 384 stopped responding to aux work). I believe it's a different issue.
   
    On 6/28/18, 18:14, "[hidden email]" <[hidden email]> wrote:
   
        It may or may not apply but we had a similar problem with system level
        work being scheduled on freed processes, and have been debugging it with
        Ericsson via
       
        https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.erlang.org_browse_ERL-2D573&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=0rT1IQzTc-5vHQf8ht6-x-ib_QvAEDJvef2Q98CuKgI&m=To7mEUnu73Ram_9CzpEXWHPvoyDBDz5HDUpb2LyN8p0&s=-TFczFM6WlUROTl7JqcN8u6DmWN3djIS9xhIBlNWgJc&e=
       
        for the last few months.  There's a branch of 20 in one of the later
        comments which might help, and might be worth a try.
       
        HTH,
       
        -Anthony
       
       
        > I’m trying to debug some weird condition when any misc system task hangs.
        > It seems to affect OTP 20 (but not 16) on FreeBSD 10.3 and 11.
        >
        > It is a rare problem happening after 5-7 days under some load (~40% cpu
        > average on a 48 cores server).
        >
        > There is also a problem with erlang:statistics(runtime), affected by this
        > bug in FreeBSD kernel:
        > https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.freebsd.org_bugzilla_show-5Fbug.cgi-3Fid-3D227689&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=0rT1IQzTc-5vHQf8ht6-x-ib_QvAEDJvef2Q98CuKgI&m=To7mEUnu73Ram_9CzpEXWHPvoyDBDz5HDUpb2LyN8p0&s=9ADTx_DQfyQVR8StEXx8AlI49seSh0G6lW-yqem4sIg&e= (so
        > statistics:runtime() always returns the same value), however I doubt it
        > can affect anything.
        >
        > What happens:  there are several calls, e.g.
        > erlang:statistics(garbage_collection), ets:all(),
        > erts_internal:system_check() and few more. All of them do
        > erts_schedule_misc_aux_work. A misc aux work item is put into every
        > scheduler queue, and it seems that all of them except one respond. VM is
        > still working, all other processes are fine, but the one that did the call
        > is waiting in erlang:gc_info/2 (or another corresponding function), with
        > counter equals to 1. Since there is no timeout in receive statement, it
        > waits forever.
        >
        > How do I debug this? Is there any way to find a scheduler that misbehaves?
        > It is one of the normal schedulers. I’m using gdb to attach to BEAM VM.
        >
        > Unfortunately, I cannot run debug VM (it is not able to handle the load).
        > _______________________________________________
        > erlang-questions mailing list
        > [hidden email]
        > https://urldefense.proofpoint.com/v2/url?u=http-3A__erlang.org_mailman_listinfo_erlang-2Dquestions&d=DwIDaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=0rT1IQzTc-5vHQf8ht6-x-ib_QvAEDJvef2Q98CuKgI&m=To7mEUnu73Ram_9CzpEXWHPvoyDBDz5HDUpb2LyN8p0&s=v-L1XxeW7SHFxLJvBtf9oVk7FY7XgrzBtT7WvHaYEmw&e=
        >
       
       
       
   
    _______________________________________________
    erlang-questions mailing list
    [hidden email]
    https://urldefense.proofpoint.com/v2/url?u=http-3A__erlang.org_mailman_listinfo_erlang-2Dquestions&d=DwIGaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=0rT1IQzTc-5vHQf8ht6-x-ib_QvAEDJvef2Q98CuKgI&m=TkZaL2U9Ds0ST7HzmI58OQK0wYrvAvxL1N1SHeqJeYM&s=tv2xgDzpD8TktqBUY7ayug-gYZ2k4ZTmkRnixU24lnY&e=
   

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions