Unbalanced Scheduler Problem caused by Using out of date active_no_runq

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Unbalanced Scheduler Problem caused by Using out of date active_no_runq

songlu cai

Hi,

 

We find an Erlang unbalanced scheduler problem on our proxy servers(for processing and forwarding requests from clients).

 

Env:

Erlang Version: R16B03/SMP 24 schedulers online/-swt low

Mem: 64G, 1.6G occupied by beam.smp

CPU: 24 Xeon cores

 

Issue Description:

On clients, we establish some tcp keep-alive connections to proxy servers; time to time, we make some tcp short (non keep-alive) connections to proxy servers. Then clients send requests to proxy servers, proxy processes & forwards requests to backend servers.

During the test, we find 11 schedulers’ usage is 100% while other 13 schedulers’ usage is 0(idle), and busy ones are with long runq (rq->len around 100). Sometimes maybe 21 busy, 3 idle. Such state lasts for from 30 minutes to 6 hours in an uncertain way. Sometimes it disappears, after an hour or a day, occasionly, it comes back again.

 

Debug:

So I write a gdb script to catch the runq flags and other struct members, I want to know what happens when the problem comes up. Collect.sh is used to catch the runq state every 2 seconds, wakeix.sh is used to catch which runq(ix) is waked up when schedulers in the unbalanced state.

//////////////////////////////////////////This is a separator/////////////////////////////////////////

#cat collect.sh

#!/bin/sh

 

while [ 1 ];

do

date

 

#cat wakeix.sh

#!/bin/sh

//////////////////////////////////////////This is a separator/////////////////////////////////////////

gdb -p 9075 <<EOF

b wake_scheduler

c

print_rdi

detach

quit

EOF

 

#9075 is process id of beam.smp

gdb -p 9075 <<EOF

print_wakeup_reds

detach

quit

EOF

 

sleep 2

done

//////////////////////////////////////////This is a separator/////////////////////////////////////////

#cat .gdbinit

define print_rdi

        set $a = *(int*)$rdi

        printf "wake ix:%d\n", $a

end

 

define print_wakeup_reds

        set $balance_info = (long)&balance_info

        set $a = $balance_info + 664

        set $no_runqs = *(int*)$a

        set $addr = (long)&erts_aligned_run_queues

        set $start = *(long*)$addr

        set $i = 0

        while($i<24)

                set $a1 = $i*1024+$start+0x318

                set $a2 = $i*1024+$start+0x314

                set $a3 = $i*1024+$start+0x310

                set $a4 = $i*1024+$start+0x2e0

                set $a5 = $i*1024+$start+0x2dc

                set $a6 = $i*1024+$start+0x2d8

                set $a7 = $i*1024+$start

                set $wakeup_other_reds = *(int*)$a1

                set $wakeup_other = *(int*)$a2

                set $len = *(int*)$a3

                set $flags = *(int*)$a4

                set $woken = *(int*)$a5

                set $waiting = *(int*)$a6

                set $ix = *(int*)$a7

                printf "ix:%d len:%d wakeup_other:%d wakeup_other_reds:%d woken:%d waiting:%d flags:%x ", $ix, $len, $wakeup_other, $wakeup_other_reds, $woken, $waiting, $flags

                parse_flags $flags

                printf "\n"

                set $i = $i+1

        end

        printf "no_runqs:%x run_q:%d empty_q:%d\n", $no_runqs, erts_no_run_queues, no_empty_run_queues

        detach

        quit

end

 

define parse_flags

        set $f = $arg0

        if (($f&0x100000) != 0)

                printf "OUT_OF_WORK "

        end

        if (($f&0x200000) != 0)

                printf "HALFTIME_OUT_OF_WORK "

        end

        if (($f&0x400000) != 0)

                printf "SUSPENDED "

        end

        if (($f&0x800000) != 0)

                printf "CHK_CPU_BIND "

        end

        if (($f&0x1000000) != 0)

                printf "INACTIVE "

        end

        if (($f&0x2000000) != 0)

                printf "NONEMPTY "

        end

        if (($f&0x4000000) != 0)

                printf "PROTECTED"

        end

end

//////////////////////////////////////////This is a separator/////////////////////////////////////////

I catch about 100M+ data in 3 days. The catching operation is not atomic, but to a certain struct member, flags for eg, can be seemed as atomic. There are some weird data coming up along with unbalanced state. Such as:

//////////////////////////////////////////This is a separator/////////////////////////////////////////

ix:0 len:13 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0 flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:1 len:5 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0 flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:2 len:4 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0 flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:3 len:0 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0 flags:2300000 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:4 len:0 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0 flags:2300000 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:5 len:3 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0 flags:2300010 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:6 len:1 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0 flags:2300004 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:7 len:1 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0 flags:2300010 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:8 len:0 wakeup_other:797 wakeup_other_reds:0 woken:0 waiting:0 flags:2300000 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:9 len:4 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0 flags:2300010 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:10 len:12 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0 flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:11 len:9 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0 flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:12 len:17 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0 flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:13 len:14 wakeup_other:720 wakeup_other_reds:0 woken:0 waiting:0 flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:14 len:14 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0 flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:15 len:3 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0 flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:16 len:7 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0 flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:17 len:11 wakeup_other:0 wakeup_other_reds:520 woken:0 waiting:0 flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:18 len:14 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:0 flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:19 len:10 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:1 flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:20 len:10 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:1 flags:2300014 OUT_OF_WORK HALFTIME_OUT_OF_WORK NONEMPTY

ix:21 len:0 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:1 flags:3100000 OUT_OF_WORK INACTIVE NONEMPTY

ix:22 len:0 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:1 flags:3100000 OUT_OF_WORK INACTIVE NONEMPTY

ix:23 len:0 wakeup_other:0 wakeup_other_reds:0 woken:0 waiting:1 flags:1100000 OUT_OF_WORK INACTIVE

no_runqs:180015 run_q:24 empty_q:1

//////////////////////////////////////////This is a separator/////////////////////////////////////////

 

Analysis:

No_runqs is formatted in hex, 180015 means 24 online/balance runq and 21 active runq. For runq ix=21 (22th), it is with flag INACTIVE|NONEMTPY, prefix ERTS_RUNQ_FLG_ is omitted for short. I read the R16B03 wakeup code and find there is something wrong in function chk_wake_sched when a to-be-waked runq has flag INACTIVE|NONEMPTY.

//////////////////////////////////////////This is a separator/////////////////////////////////////////

static void

wake_scheduler_on_empty_runq(ErtsRunQueue *crq)

{

    int ix = crq->ix;

    int stop_ix = ix;

    int active_ix, balance_ix;

 

    get_no_runqs(&active_ix, &balance_ix);

 

    if (active_ix > balance_ix)

         active_ix = balance_ix;

 

    if (ix >= active_ix)

         stop_ix = ix = active_ix;

 

    /* Try to wake a scheduler on an active run queue */

    while (1) {

         A:ix--;

         if (ix < 0) {

             if (active_ix == stop_ix)

         B:     break;

         C:    ix = active_ix - 1;

         }

         if (ix == stop_ix)

         D:    break;

         E: if (chk_wake_sched(crq, ix, 0))

         F:    return;

    }

 

    if (active_ix < balance_ix) {

         /* Try to activate a new run queue and wake its scheduler */

         G: (void) chk_wake_sched(crq, active_ix, 1);

    }

}

static ERTS_INLINE int

chk_wake_sched(ErtsRunQueue *crq, int ix, int activate)

{

    Uint32 flags;

    ErtsRunQueue *wrq;

    if (crq->ix == ix)

         return 0;

    wrq = ERTS_RUNQ_IX(ix);

    flags = ERTS_RUNQ_FLGS_GET(wrq);

    H: if (!(flags & (ERTS_RUNQ_FLG_SUSPENDED|ERTS_RUNQ_FLG_NONEMPTY))) {

         if (activate) {

             if (try_inc_no_active_runqs(ix+1))

                   (void) ERTS_RUNQ_FLGS_UNSET(wrq, ERTS_RUNQ_FLG_INACTIVE);

         }

I:       wake_scheduler(wrq, 0);

         return 1;

    }

    J: return 0;

}

//////////////////////////////////////////This is a separator/////////////////////////////////////////


Root cause:

A possible execution path:

Step 1,scheduler with runq ix=10 calls wakeup_other_check, then fall through to wake_scheduler_on_empty_runq, active_ix=21 stop_ix=10

Step 2,A->E->H->J loops 10 times

Step 3,A(ix=-1)->C(ix=20)->E->H->J

Step 4,A->E->H->J loops 9 times

Step 5,A(ix=10)->D->G(active_ix=21)->H(ix=21)->J

 

The problem happens in step 5 H->J, the runq ix=21 (22th) is with flag NONEMPTY|INACTIVE, so it skips “H:if (!(flags & (ERTS_RUNQ_FLG_SUSPENDED|ERTS_RUNQ_FLG_NONEMPTY)))”, and jumps to “J: return 0;”. As a result, the 22th(ix=21) runq will never be waked up, unless, another check_balance is called and clear the NONEMPTY flag.

But not every check_balance call wants to clear the NONEMPTY flag, it depends on the history & current workset , in a word, it is hard to know when we can get rid of such unbalanced state. In out stress test, the unbalanced state lasts from 30 minutes to several hours, it is relative with the test case.

The NONEMPTY|INACTIVE combination is also dangerous, during test it happened runq(ix=10, 11th) with such flag, only 11/24 of CPU can be used, others were idle, the clients were jammed with long latency.

 

Where does NONEMPTY|INACTIVE flag come from?

Two functions are involved:

static ERTS_INLINE void get_no_runqs(int *active, int *used)

static ERTS_INLINE void set_no_active_runqs(int active)

 

A possible path:

Time T1: Thread A does wakeup_check, then call get_no_runqs, gets active_ix=24 (24 runqs are all active), and decides to wake up runq ix=20 (21th, named as Thread W)

Time T2: Thread B does schedule, then call check_balance, set active runq no to 21, set_no_active_runqs(21) the runq ix=21 (22th) is with flag INACTIVE

Time T3: Thread A does wake_scheduler, Thread W is waked up from kernel futex-wait (scheduler_wait), then Thread W call non_empty_runq(runq ix=21), flag is set to INACTIVE|NONEMPTY

Time T4: Thread W cant steal process from another runq due to INACTIVE flag, so it sleeps on futex again, with flag = INACTIVE|NONEMPTY

At last, the runq ix=21 is with flag INACTIVE|NONEMPTY and never be waked up until another lucky check_balance call to help it get rid of such situation.

The essence of the problem is we are using a value(active_ix) that out of date(updated by others).

 

How to fix:

1, Using mutex when involved with balance_info.no_runqs, but the critical region will be very long, and there are too many codes to be modified, and may degrade the performance.

2, Judge the INACTIVE|NONEMPTY flag in function chk_wake_sched, it means that sometimes we are waking up(activating) a NONEMPTY runq.

3, Or Another Way?


Best Regards,

Zijia


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Unbalanced Scheduler Problem caused by Using out of date active_no_runq

Scott Lystig Fritchie
songlu cai <[hidden email]> wrote:

slc> How to fix:

slc> [...]

slc> 3, Or Another Way?

Wow, that's quite a diagnosis.  I'm not a good judge of the race
condition that you've found or your fix.  I can provide some context,
however, in case that you weren't aware of it.  It might help to create
a Real, Final, 100% Correct Fix ... something which does not exist right
now.

The best workaround is to use "+scl false" and "+sfwi" with a value of
500 or a bit smaller.  See the discussion last month about it,

    http://erlang.org/pipermail/erlang-questions/2014-September/081017.html

My colleague Joe Blomstedt wrote a demo program that can cause scheduler
collapse to happen pretty quickly.  It might be useful for judging how
well any fix works ... at Basho we had a terrible time trying to
reproduce this bug before Joe found a semi-reliable trigger.

    https://github.com/basho/nifwait

It is discussed in this email thread (which is broken across two URLs,
sorry).

    http://erlang.org/pipermail/erlang-questions/2012-October/069503.html
        (first message only, I don't know why)
    http://erlang.org/pipermail/erlang-questions/2012-October/069585.html
        (the rest of the thread)

If your analysis is correct ... then hopefully this can lead quickly to
a Real, Final, 100% Correct Fix.  I'm tired of diagnosing systems that
suffer scheduler collapse and then discover that the customer forgot to
add the magic +sfwi and +scl flags in their runtime configuration to
work around that !@#$! bug.

-Scott
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Unbalanced Scheduler Problem caused by Using out of date active_no_runq

songlu cai

Hi Scott,

 

Thanks for your attention & quick reply.

It seems that quite a few people suffer from this problem.

 

Scott>The best workaround is to use "+scl false" and "+sfwi" with a value of 500 or a bit smaller

1, we set +sfwi 500

2,at first, we set +scl false, but it causes unbalanced runq length among all runqs on R16B03, then we set +scl true (by default), so +scl false is not a safe choice on R16B03

 

Our test cmdline:

/home/Xxx/erts-5.10.4/bin/beam.smp -zdbbl 8192 -sbt db -sbwt very_short -swt low -sfwi 500 -MBmmsbc 100 -MHmmsbc 100 -MBmmmbc 100 -MHmmmbc 100 -MMscs 20480 -MBsmbcs 10240 -MHsbct 2048 -W w -e 50000 -Q 1000000 -hmbs 46422 -hms 2586 -P 1000000 -A 16 -K true -d -Bi -sct L23T0C0P0N0:L22T1C1P0N0:L21T2C2P0N0:L20T3C3P0N0:L19T4C4P0N0:L18T5C5P0N0:L17T6C0P1N1:L16T7C1P1N1:L15T8C2P1N1:L14T9C3P1N1:L13T10C4P1N1:L12T11C5P1N1:L11T12C0P0N0:L10T13C1P0N0:L9T14C2P0N0:L8T15C3P0N0:L7T16C4P0N0:L6T17C5P0N0:L5T18C0P1N1:L4T19C1P1N1:L3T20C2P1N1:L2T21C3P1N1:L1T22C4P1N1:L0T23C5P1N1 -swct medium -- -root /home/Xxx/dir -progname Xxx -- -home /root -- -boot /home/Xxx/dir -mode interactive -config /home/Xxx/sys.config -shutdown_time 30000 -heart -setcookie Xxx -name proxy@Xxx – console

 

And , apart from everything else, INACTIVE|NONEMPTY is not a normal state of runq flag.

Next few days, I will fix the not-yet-be-sure bug in my way based on R16B03 and run the test cases again.


Best Regards,

Songlu Cai


2014-10-30 15:53 GMT+08:00 Scott Lystig Fritchie <[hidden email]>:
songlu cai <[hidden email]> wrote:

slc> How to fix:

slc> [...]

slc> 3, Or Another Way?

Wow, that's quite a diagnosis.  I'm not a good judge of the race
condition that you've found or your fix.  I can provide some context,
however, in case that you weren't aware of it.  It might help to create
a Real, Final, 100% Correct Fix ... something which does not exist right
now.

The best workaround is to use "+scl false" and "+sfwi" with a value of
500 or a bit smaller.  See the discussion last month about it,

    http://erlang.org/pipermail/erlang-questions/2014-September/081017.html

My colleague Joe Blomstedt wrote a demo program that can cause scheduler
collapse to happen pretty quickly.  It might be useful for judging how
well any fix works ... at Basho we had a terrible time trying to
reproduce this bug before Joe found a semi-reliable trigger.

    https://github.com/basho/nifwait

It is discussed in this email thread (which is broken across two URLs,
sorry).

    http://erlang.org/pipermail/erlang-questions/2012-October/069503.html
        (first message only, I don't know why)
    http://erlang.org/pipermail/erlang-questions/2012-October/069585.html
        (the rest of the thread)

If your analysis is correct ... then hopefully this can lead quickly to
a Real, Final, 100% Correct Fix.  I'm tired of diagnosing systems that
suffer scheduler collapse and then discover that the customer forgot to
add the magic +sfwi and +scl flags in their runtime configuration to
work around that !@#$! bug.

-Scott


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Unbalanced Scheduler Problem caused by Using out of date active_no_runq

songlu cai
Hi Scott,

Last week I fix the bug in a simple way, then I run the fixed-version compared with the old unbalanced-version. 
The two nodes are with same pressure &  timeline. 
The unbalanced(collapsed) state comes up several times on the old version, but never show up on the fixed version.
The pressure spreads averagely among 24 schedulers on the fixed version (especially when with high pressure).
In fact, the fixed version is with higher pressure when old version runs into the unbalanced state.
Because the old version is only with 4 schedulers, and easily gets to the bottleneck,  its cpu is 400%, at the same time, fixed version is with cpu 1200%.
So I can be sure that the root cause of unbalanced scheduler (scheduler collapse) is "using out of date active_no_runq", just as analyzed before.

I just modify the chk_wake_sched function, code diff as below:

Index: emulator/beam/erl_process.c
===================================================================
--- emulator/beam/erl_process.c (revision 298281)
+++ emulator/beam/erl_process.c (working copy)
@@ -2694,6 +2694,16 @@
        return 0;
     wrq = ERTS_RUNQ_IX(ix);
     flags = ERTS_RUNQ_FLGS_GET(wrq);
+
+    if ( activate &&
+       (flags & ERTS_RUNQ_FLG_NONEMPTY)  && 
+       (flags & ERTS_RUNQ_FLG_INACTIVE)) {
+       if (try_inc_no_active_runqs(ix+1))
+               (void) ERTS_RUNQ_FLGS_UNSET(wrq, ERTS_RUNQ_FLG_INACTIVE);
+       wake_scheduler(wrq, 0);
+       return 1;
+    }
+
     if (!(flags & (ERTS_RUNQ_FLG_SUSPENDED|ERTS_RUNQ_FLG_NONEMPTY))) {
        if (activate) {
            if (try_inc_no_active_runqs(ix+1))

It saves the scheduler from the weird state. It is not a perfect fix, but an effective one.
Scott, would you please apply this patch to R16B03 and run your test case again?
Thank you very much and waiting for your reply.
And I will run it for a week to ensure that we do fix the problem.

Best Regards,
Songlu Cai

2014-10-31 10:55 GMT+08:00 songlu cai <[hidden email]>:

Hi Scott,

 

Thanks for your attention & quick reply.

It seems that quite a few people suffer from this problem.

 

Scott>The best workaround is to use "+scl false" and "+sfwi" with a value of 500 or a bit smaller

1, we set +sfwi 500

2,at first, we set +scl false, but it causes unbalanced runq length among all runqs on R16B03, then we set +scl true (by default), so +scl false is not a safe choice on R16B03

 

Our test cmdline:

/home/Xxx/erts-5.10.4/bin/beam.smp -zdbbl 8192 -sbt db -sbwt very_short -swt low -sfwi 500 -MBmmsbc 100 -MHmmsbc 100 -MBmmmbc 100 -MHmmmbc 100 -MMscs 20480 -MBsmbcs 10240 -MHsbct 2048 -W w -e 50000 -Q 1000000 -hmbs 46422 -hms 2586 -P 1000000 -A 16 -K true -d -Bi -sct L23T0C0P0N0:L22T1C1P0N0:L21T2C2P0N0:L20T3C3P0N0:L19T4C4P0N0:L18T5C5P0N0:L17T6C0P1N1:L16T7C1P1N1:L15T8C2P1N1:L14T9C3P1N1:L13T10C4P1N1:L12T11C5P1N1:L11T12C0P0N0:L10T13C1P0N0:L9T14C2P0N0:L8T15C3P0N0:L7T16C4P0N0:L6T17C5P0N0:L5T18C0P1N1:L4T19C1P1N1:L3T20C2P1N1:L2T21C3P1N1:L1T22C4P1N1:L0T23C5P1N1 -swct medium -- -root /home/Xxx/dir -progname Xxx -- -home /root -- -boot /home/Xxx/dir -mode interactive -config /home/Xxx/sys.config -shutdown_time 30000 -heart -setcookie Xxx -name proxy@Xxx – console

 

And , apart from everything else, INACTIVE|NONEMPTY is not a normal state of runq flag.

Next few days, I will fix the not-yet-be-sure bug in my way based on R16B03 and run the test cases again.


Best Regards,

Songlu Cai


2014-10-30 15:53 GMT+08:00 Scott Lystig Fritchie <[hidden email]>:
songlu cai <[hidden email]> wrote:

slc> How to fix:

slc> [...]

slc> 3, Or Another Way?

Wow, that's quite a diagnosis.  I'm not a good judge of the race
condition that you've found or your fix.  I can provide some context,
however, in case that you weren't aware of it.  It might help to create
a Real, Final, 100% Correct Fix ... something which does not exist right
now.

The best workaround is to use "+scl false" and "+sfwi" with a value of
500 or a bit smaller.  See the discussion last month about it,

    http://erlang.org/pipermail/erlang-questions/2014-September/081017.html

My colleague Joe Blomstedt wrote a demo program that can cause scheduler
collapse to happen pretty quickly.  It might be useful for judging how
well any fix works ... at Basho we had a terrible time trying to
reproduce this bug before Joe found a semi-reliable trigger.

    https://github.com/basho/nifwait

It is discussed in this email thread (which is broken across two URLs,
sorry).

    http://erlang.org/pipermail/erlang-questions/2012-October/069503.html
        (first message only, I don't know why)
    http://erlang.org/pipermail/erlang-questions/2012-October/069585.html
        (the rest of the thread)

If your analysis is correct ... then hopefully this can lead quickly to
a Real, Final, 100% Correct Fix.  I'm tired of diagnosing systems that
suffer scheduler collapse and then discover that the customer forgot to
add the magic +sfwi and +scl flags in their runtime configuration to
work around that !@#$! bug.

-Scott



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Unbalanced Scheduler Problem caused by Using out of date active_no_runq

Rickard Green-2
Hi Songlu Cai,

Thanks for your work on this!

Although it is not an error that NONEMPTY and INACTIVE flag is set at
the same time it should occur rather infrequently. Assuming that this
would be the case, your fix should have little effect on the behavior,
so the root cause should be elsewhere. I've found a potential race
when the NONEMPTY flag could end up on the run-queue of a waiting
scheduler which could cause the problem you are seeing.

When implementing support for balancing on scheduler utilization for
OTP 17.0 I rewrote the code that set the NONEMPTY flag. This rewrite
removes that potential race. I've back-ported this rewrite based on
R16B03-1. It can be found in the
rickard/R16B03-1/load_balance/OTP-11385 branch
<https://github.com/rickard-green/otp/tree/rickard/R16B03-1/load_balance/OTP-11385>
of my github repo <https://github.com/rickard-green/otp.git>. Please
try it out and see if it solves your problem (use the same
configuration as before).

Something similar to your fix should perhaps in the end be introduced
anyway (one also has to check the SUSPENDED flag though), since there
are no good reason to prevent the activation just because the
run-queue happen to be non-empty. It would, however, be good to see
that the root issue has been fixed first before introducing this.

Regards,
Rickard Green, Erlang/OTP, Ericsson AB

On Mon, Nov 3, 2014 at 4:18 PM, songlu cai <[hidden email]> wrote:

> Hi Scott,
>
> Last week I fix the bug in a simple way, then I run the fixed-version
> compared with the old unbalanced-version.
> The two nodes are with same pressure &  timeline.
> The unbalanced(collapsed) state comes up several times on the old version,
> but never show up on the fixed version.
> The pressure spreads averagely among 24 schedulers on the fixed version
> (especially when with high pressure).
> In fact, the fixed version is with higher pressure when old version runs
> into the unbalanced state.
> Because the old version is only with 4 schedulers, and easily gets to the
> bottleneck,  its cpu is 400%, at the same time, fixed version is with cpu
> 1200%.
> So I can be sure that the root cause of unbalanced scheduler (scheduler
> collapse) is "using out of date active_no_runq", just as analyzed before.
>
> I just modify the chk_wake_sched function, code diff as below:
>
> Index: emulator/beam/erl_process.c
> ===================================================================
> --- emulator/beam/erl_process.c (revision 298281)
> +++ emulator/beam/erl_process.c (working copy)
> @@ -2694,6 +2694,16 @@
>         return 0;
>      wrq = ERTS_RUNQ_IX(ix);
>      flags = ERTS_RUNQ_FLGS_GET(wrq);
> +
> +    if ( activate &&
> +       (flags & ERTS_RUNQ_FLG_NONEMPTY)  &&
> +       (flags & ERTS_RUNQ_FLG_INACTIVE)) {
> +       if (try_inc_no_active_runqs(ix+1))
> +               (void) ERTS_RUNQ_FLGS_UNSET(wrq, ERTS_RUNQ_FLG_INACTIVE);
> +       wake_scheduler(wrq, 0);
> +       return 1;
> +    }
> +
>      if (!(flags & (ERTS_RUNQ_FLG_SUSPENDED|ERTS_RUNQ_FLG_NONEMPTY))) {
>         if (activate) {
>             if (try_inc_no_active_runqs(ix+1))
>
> It saves the scheduler from the weird state. It is not a perfect fix, but an
> effective one.
> Scott, would you please apply this patch to R16B03 and run your test case
> again?
> Thank you very much and waiting for your reply.
> And I will run it for a week to ensure that we do fix the problem.
>
> Best Regards,
> Songlu Cai
>
> 2014-10-31 10:55 GMT+08:00 songlu cai <[hidden email]>:
>>
>> Hi Scott,
>>
>>
>>
>> Thanks for your attention & quick reply.
>>
>> It seems that quite a few people suffer from this problem.
>>
>>
>>
>> Scott>The best workaround is to use "+scl false" and "+sfwi" with a value
>> of 500 or a bit smaller
>>
>> 1, we set +sfwi 500
>>
>> 2,at first, we set +scl false, but it causes unbalanced runq length among
>> all runqs on R16B03, then we set +scl true (by default), so +scl false is
>> not a safe choice on R16B03
>>
>>
>>
>> Our test cmdline:
>>
>> /home/Xxx/erts-5.10.4/bin/beam.smp -zdbbl 8192 -sbt db -sbwt very_short
>> -swt low -sfwi 500 -MBmmsbc 100 -MHmmsbc 100 -MBmmmbc 100 -MHmmmbc 100
>> -MMscs 20480 -MBsmbcs 10240 -MHsbct 2048 -W w -e 50000 -Q 1000000 -hmbs
>> 46422 -hms 2586 -P 1000000 -A 16 -K true -d -Bi -sct
>> L23T0C0P0N0:L22T1C1P0N0:L21T2C2P0N0:L20T3C3P0N0:L19T4C4P0N0:L18T5C5P0N0:L17T6C0P1N1:L16T7C1P1N1:L15T8C2P1N1:L14T9C3P1N1:L13T10C4P1N1:L12T11C5P1N1:L11T12C0P0N0:L10T13C1P0N0:L9T14C2P0N0:L8T15C3P0N0:L7T16C4P0N0:L6T17C5P0N0:L5T18C0P1N1:L4T19C1P1N1:L3T20C2P1N1:L2T21C3P1N1:L1T22C4P1N1:L0T23C5P1N1
>> -swct medium -- -root /home/Xxx/dir -progname Xxx -- -home /root -- -boot
>> /home/Xxx/dir -mode interactive -config /home/Xxx/sys.config -shutdown_time
>> 30000 -heart -setcookie Xxx -name proxy@Xxx – console
>>
>>
>>
>> And , apart from everything else, INACTIVE|NONEMPTY is not a normal state
>> of runq flag.
>>
>> Next few days, I will fix the not-yet-be-sure bug in my way based on
>> R16B03 and run the test cases again.
>>
>>
>> Best Regards,
>>
>> Songlu Cai
>>
>>
>> 2014-10-30 15:53 GMT+08:00 Scott Lystig Fritchie <[hidden email]>:
>>>
>>> songlu cai <[hidden email]> wrote:
>>>
>>> slc> How to fix:
>>>
>>> slc> [...]
>>>
>>> slc> 3, Or Another Way?
>>>
>>> Wow, that's quite a diagnosis.  I'm not a good judge of the race
>>> condition that you've found or your fix.  I can provide some context,
>>> however, in case that you weren't aware of it.  It might help to create
>>> a Real, Final, 100% Correct Fix ... something which does not exist right
>>> now.
>>>
>>> The best workaround is to use "+scl false" and "+sfwi" with a value of
>>> 500 or a bit smaller.  See the discussion last month about it,
>>>
>>>
>>> http://erlang.org/pipermail/erlang-questions/2014-September/081017.html
>>>
>>> My colleague Joe Blomstedt wrote a demo program that can cause scheduler
>>> collapse to happen pretty quickly.  It might be useful for judging how
>>> well any fix works ... at Basho we had a terrible time trying to
>>> reproduce this bug before Joe found a semi-reliable trigger.
>>>
>>>     https://github.com/basho/nifwait
>>>
>>> It is discussed in this email thread (which is broken across two URLs,
>>> sorry).
>>>
>>>     http://erlang.org/pipermail/erlang-questions/2012-October/069503.html
>>>         (first message only, I don't know why)
>>>     http://erlang.org/pipermail/erlang-questions/2012-October/069585.html
>>>         (the rest of the thread)
>>>
>>> If your analysis is correct ... then hopefully this can lead quickly to
>>> a Real, Final, 100% Correct Fix.  I'm tired of diagnosing systems that
>>> suffer scheduler collapse and then discover that the customer forgot to
>>> add the magic +sfwi and +scl flags in their runtime configuration to
>>> work around that !@#$! bug.
>>>
>>> -Scott
>>
>>
>
>
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions
>

--
Rickard Green, Erlang/OTP, Ericsson AB
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Unbalanced Scheduler Problem caused by Using out of date active_no_runq

songlu cai
Hi Richard,

Thank you for your help & quick reply!

Follow your suggestion, I do another one-week test with four versions of erlang.
V1: R16B03.origin with collapse problem
V2: R16B03.fix with my own fix patch
V3: R16B03.backport from github you provide
V4: R17B03 from erlang.com 

During the one-week test, V1 collapses twice and lasts for nearly two days.
V2 & V3 & V4 behave the same, balanced as normal.
So I will use R16B03.backport to fix my problem.
Thanks again.

Best Regards,
Songlu Cai

2014-11-05 6:14 GMT+08:00 Rickard Green <[hidden email]>:
Hi Songlu Cai,

Thanks for your work on this!

Although it is not an error that NONEMPTY and INACTIVE flag is set at
the same time it should occur rather infrequently. Assuming that this
would be the case, your fix should have little effect on the behavior,
so the root cause should be elsewhere. I've found a potential race
when the NONEMPTY flag could end up on the run-queue of a waiting
scheduler which could cause the problem you are seeing.

When implementing support for balancing on scheduler utilization for
OTP 17.0 I rewrote the code that set the NONEMPTY flag. This rewrite
removes that potential race. I've back-ported this rewrite based on
R16B03-1. It can be found in the
rickard/R16B03-1/load_balance/OTP-11385 branch
<https://github.com/rickard-green/otp/tree/rickard/R16B03-1/load_balance/OTP-11385>
of my github repo <https://github.com/rickard-green/otp.git>. Please
try it out and see if it solves your problem (use the same
configuration as before).

Something similar to your fix should perhaps in the end be introduced
anyway (one also has to check the SUSPENDED flag though), since there
are no good reason to prevent the activation just because the
run-queue happen to be non-empty. It would, however, be good to see
that the root issue has been fixed first before introducing this.

Regards,
Rickard Green, Erlang/OTP, Ericsson AB

On Mon, Nov 3, 2014 at 4:18 PM, songlu cai <[hidden email]> wrote:
> Hi Scott,
>
> Last week I fix the bug in a simple way, then I run the fixed-version
> compared with the old unbalanced-version.
> The two nodes are with same pressure &  timeline.
> The unbalanced(collapsed) state comes up several times on the old version,
> but never show up on the fixed version.
> The pressure spreads averagely among 24 schedulers on the fixed version
> (especially when with high pressure).
> In fact, the fixed version is with higher pressure when old version runs
> into the unbalanced state.
> Because the old version is only with 4 schedulers, and easily gets to the
> bottleneck,  its cpu is 400%, at the same time, fixed version is with cpu
> 1200%.
> So I can be sure that the root cause of unbalanced scheduler (scheduler
> collapse) is "using out of date active_no_runq", just as analyzed before.
>
> I just modify the chk_wake_sched function, code diff as below:
>
> Index: emulator/beam/erl_process.c
> ===================================================================
> --- emulator/beam/erl_process.c (revision 298281)
> +++ emulator/beam/erl_process.c (working copy)
> @@ -2694,6 +2694,16 @@
>         return 0;
>      wrq = ERTS_RUNQ_IX(ix);
>      flags = ERTS_RUNQ_FLGS_GET(wrq);
> +
> +    if ( activate &&
> +       (flags & ERTS_RUNQ_FLG_NONEMPTY)  &&
> +       (flags & ERTS_RUNQ_FLG_INACTIVE)) {
> +       if (try_inc_no_active_runqs(ix+1))
> +               (void) ERTS_RUNQ_FLGS_UNSET(wrq, ERTS_RUNQ_FLG_INACTIVE);
> +       wake_scheduler(wrq, 0);
> +       return 1;
> +    }
> +
>      if (!(flags & (ERTS_RUNQ_FLG_SUSPENDED|ERTS_RUNQ_FLG_NONEMPTY))) {
>         if (activate) {
>             if (try_inc_no_active_runqs(ix+1))
>
> It saves the scheduler from the weird state. It is not a perfect fix, but an
> effective one.
> Scott, would you please apply this patch to R16B03 and run your test case
> again?
> Thank you very much and waiting for your reply.
> And I will run it for a week to ensure that we do fix the problem.
>
> Best Regards,
> Songlu Cai
>
> 2014-10-31 10:55 GMT+08:00 songlu cai <[hidden email]>:
>>
>> Hi Scott,
>>
>>
>>
>> Thanks for your attention & quick reply.
>>
>> It seems that quite a few people suffer from this problem.
>>
>>
>>
>> Scott>The best workaround is to use "+scl false" and "+sfwi" with a value
>> of 500 or a bit smaller
>>
>> 1, we set +sfwi 500
>>
>> 2,at first, we set +scl false, but it causes unbalanced runq length among
>> all runqs on R16B03, then we set +scl true (by default), so +scl false is
>> not a safe choice on R16B03
>>
>>
>>
>> Our test cmdline:
>>
>> /home/Xxx/erts-5.10.4/bin/beam.smp -zdbbl 8192 -sbt db -sbwt very_short
>> -swt low -sfwi 500 -MBmmsbc 100 -MHmmsbc 100 -MBmmmbc 100 -MHmmmbc 100
>> -MMscs 20480 -MBsmbcs 10240 -MHsbct 2048 -W w -e 50000 -Q 1000000 -hmbs
>> 46422 -hms 2586 -P 1000000 -A 16 -K true -d -Bi -sct
>> L23T0C0P0N0:L22T1C1P0N0:L21T2C2P0N0:L20T3C3P0N0:L19T4C4P0N0:L18T5C5P0N0:L17T6C0P1N1:L16T7C1P1N1:L15T8C2P1N1:L14T9C3P1N1:L13T10C4P1N1:L12T11C5P1N1:L11T12C0P0N0:L10T13C1P0N0:L9T14C2P0N0:L8T15C3P0N0:L7T16C4P0N0:L6T17C5P0N0:L5T18C0P1N1:L4T19C1P1N1:L3T20C2P1N1:L2T21C3P1N1:L1T22C4P1N1:L0T23C5P1N1
>> -swct medium -- -root /home/Xxx/dir -progname Xxx -- -home /root -- -boot
>> /home/Xxx/dir -mode interactive -config /home/Xxx/sys.config -shutdown_time
>> 30000 -heart -setcookie Xxx -name proxy@Xxx – console
>>
>>
>>
>> And , apart from everything else, INACTIVE|NONEMPTY is not a normal state
>> of runq flag.
>>
>> Next few days, I will fix the not-yet-be-sure bug in my way based on
>> R16B03 and run the test cases again.
>>
>>
>> Best Regards,
>>
>> Songlu Cai
>>
>>
>> 2014-10-30 15:53 GMT+08:00 Scott Lystig Fritchie <[hidden email]>:
>>>
>>> songlu cai <[hidden email]> wrote:
>>>
>>> slc> How to fix:
>>>
>>> slc> [...]
>>>
>>> slc> 3, Or Another Way?
>>>
>>> Wow, that's quite a diagnosis.  I'm not a good judge of the race
>>> condition that you've found or your fix.  I can provide some context,
>>> however, in case that you weren't aware of it.  It might help to create
>>> a Real, Final, 100% Correct Fix ... something which does not exist right
>>> now.
>>>
>>> The best workaround is to use "+scl false" and "+sfwi" with a value of
>>> 500 or a bit smaller.  See the discussion last month about it,
>>>
>>>
>>> http://erlang.org/pipermail/erlang-questions/2014-September/081017.html
>>>
>>> My colleague Joe Blomstedt wrote a demo program that can cause scheduler
>>> collapse to happen pretty quickly.  It might be useful for judging how
>>> well any fix works ... at Basho we had a terrible time trying to
>>> reproduce this bug before Joe found a semi-reliable trigger.
>>>
>>>     https://github.com/basho/nifwait
>>>
>>> It is discussed in this email thread (which is broken across two URLs,
>>> sorry).
>>>
>>>     http://erlang.org/pipermail/erlang-questions/2012-October/069503.html
>>>         (first message only, I don't know why)
>>>     http://erlang.org/pipermail/erlang-questions/2012-October/069585.html
>>>         (the rest of the thread)
>>>
>>> If your analysis is correct ... then hopefully this can lead quickly to
>>> a Real, Final, 100% Correct Fix.  I'm tired of diagnosing systems that
>>> suffer scheduler collapse and then discover that the customer forgot to
>>> add the magic +sfwi and +scl flags in their runtime configuration to
>>> work around that !@#$! bug.
>>>
>>> -Scott
>>
>>
>
>
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions
>

--
Rickard Green, Erlang/OTP, Ericsson AB


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions