Quantcast

Distributed node crashes silently when initially receiving a big chunk of messages from another node

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Distributed node crashes silently when initially receiving a big chunk of messages from another node

Philipp Unterbrunner
Hello,

I have run into a serious and very annoying bug.

Affects (at least); R13B04, R14A, R14B, R14B01
Platform: Ubuntu Linux 10.10, kernel 2.6.35-25-server (SMP)

When a newly started distributed node receives a high number of messages from another node, the newly started node crashes silently. Nothing is printed to the console. No crash dump or core dump is produced.

In trying to find a work-around, I found the following curious behavior:

* The bug *only* occurs for distributed nodes (but regardless of whether the nodes run on the same machine).
* Waiting a few seconds (or even longer) before sending the first message to the newly started node does *not* make a difference. The node will still crash when confronted with a large number of incoming messages later.
* Speed matters. When doing a debug build, the bug appears less often then when doing a release build, especially when HiPE is enabled. However, I managed to cause the bug even in debug mode, and when OTP was not compiled with native libs. The bug is simply much less likely to be observed.
* The number of messages sent *initially* matters most. Slowly "ramping up" the load is a work-around. Once a node is working at high throughput, it is OK to stop sending messages for an arbitrary period and at a later point send a big chunk of messages that would have killed the node if sent initially.
* Timing matters. Running the receiver node with +T 7 or higher makes the problem disappear.
* Setting the sender node's distribution buffer size to the minimum (+zdbbl 1) makes the problem appear less often.

I have reproduced the bug in various applications. The behavior described above also makes it fairly obvious that the application is not at fault.

Rather, it appears that the receiver node is unable to buffer incoming messages and crashes. Of particular interest here is the fact that "ramping up" the load is a work-around. I suspect a low-level race condition where the receiver node does not allocate sufficient buffer space in time and crashes.

Given that the existing work-arounds are not desirable ("ramp up" requires changes to the application code, +T 7 and +zdbbl 1 decrease performance), and given that the bug now persists over multiple releases, I hope someone can soon look into it.

Thank you,

Philipp

signature.asc (270 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

[erlang-bugs 10] Re: [erlang-bugs] Distributed node crashes silently when initially receiving a big chunk of messages from another node

Philipp Unterbrunner
The bug persists in r14b02.

If I find time, I will make a small demo application so that others can reproduce the bug.

Philipp

On 02/23/2011 04:14 PM, Philipp Unterbrunner wrote:
Hello,

I have run into a serious and very annoying bug.

Affects (at least); R13B04, R14A, R14B, R14B01
Platform: Ubuntu Linux 10.10, kernel 2.6.35-25-server (SMP)

When a newly started distributed node receives a high number of messages from another node, the newly started node crashes silently. Nothing is printed to the console. No crash dump or core dump is produced.

In trying to find a work-around, I found the following curious behavior:

* The bug *only* occurs for distributed nodes (but regardless of whether the nodes run on the same machine).
* Waiting a few seconds (or even longer) before sending the first message to the newly started node does *not* make a difference. The node will still crash when confronted with a large number of incoming messages later.
* Speed matters. When doing a debug build, the bug appears less often then when doing a release build, especially when HiPE is enabled. However, I managed to cause the bug even in debug mode, and when OTP was not compiled with native libs. The bug is simply much less likely to be observed.
* The number of messages sent *initially* matters most. Slowly "ramping up" the load is a work-around. Once a node is working at high throughput, it is OK to stop sending messages for an arbitrary period and at a later point send a big chunk of messages that would have killed the node if sent initially.
* Timing matters. Running the receiver node with +T 7 or higher makes the problem disappear.
* Setting the sender node's distribution buffer size to the minimum (+zdbbl 1) makes the problem appear less often.

I have reproduced the bug in various applications. The behavior described above also makes it fairly obvious that the application is not at fault.

Rather, it appears that the receiver node is unable to buffer incoming messages and crashes. Of particular interest here is the fact that "ramping up" the load is a work-around. I suspect a low-level race condition where the receiver node does not allocate sufficient buffer space in time and crashes.

Given that the existing work-arounds are not desirable ("ramp up" requires changes to the application code, +T 7 and +zdbbl 1 decrease performance), and given that the bug now persists over multiple releases, I hope someone can soon look into it.

Thank you,

Philipp

_______________________________________________
erlang-bugs mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-bugs

signature.asc (270 bytes) Download Attachment
pan
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [erlang-bugs 10] Re: Distributed node crashes silently when initially receiving a big chunk of messages from another node

pan
Hi!

This sounds really bad! A demo application that reproduces the bug would
be really nice.

Have you tried to enable core dumps to see if the erlang node crashes with
a segfault? I suppose there are no erl_crash.dump files left after the
crash that I can look at either?

Any way to reproduce it would make it more easy to find!

Cheers,
/Patrik

On Mon, 28 Mar 2011, Philipp Unterbrunner wrote:

> The bug persists in r14b02.
>
> If I find time, I will make a small demo application so that others can
> reproduce the bug.
>
> Philipp
>
> On 02/23/2011 04:14 PM, Philipp Unterbrunner wrote:
>> Hello,
>>
>> I have run into a serious and very annoying bug.
>>
>> Affects (at least); R13B04, R14A, R14B, R14B01
>> Platform: Ubuntu Linux 10.10, kernel 2.6.35-25-server (SMP)
>>
>> When a newly started distributed node receives a high number of messages from another node, the newly started node crashes silently. Nothing is printed to the console. No crash dump or core dump is produced.
>>
>> In trying to find a work-around, I found the following curious behavior:
>>
>> * The bug *only* occurs for distributed nodes (but regardless of whether the nodes run on the same machine).
>> * Waiting a few seconds (or even longer) before sending the first message to the newly started node does *not* make a difference. The node will still crash when confronted with a large number of incoming messages later.
>> * Speed matters. When doing a debug build, the bug appears less often then when doing a release build, especially when HiPE is enabled. However, I managed to cause the bug even in debug mode, and when OTP was not compiled with native libs. The bug is simply much less likely to be observed.
>> * The number of messages sent *initially* matters most. Slowly "ramping up" the load is a work-around. Once a node is working at high throughput, it is OK to stop sending messages for an arbitrary period and at a later point send a big chunk of messages that would have killed the node if sent initially.
>> * Timing matters. Running the receiver node with +T 7 or higher makes the problem disappear.
>> * Setting the sender node's distribution buffer size to the minimum (+zdbbl 1) makes the problem appear less often.
>>
>> I have reproduced the bug in various applications. The behavior described above also makes it fairly obvious that the application is not at fault.
>>
>> Rather, it appears that the receiver node is unable to buffer incoming messages and crashes. Of particular interest here is the fact that "ramping up" the load is a work-around. I suspect a low-level race condition where the receiver node does not allocate sufficient buffer space in time and crashes.
>>
>> Given that the existing work-arounds are not desirable ("ramp up" requires changes to the application code, +T 7 and +zdbbl 1 decrease performance), and given that the bug now persists over multiple releases, I hope someone can soon look into it.
>>
>> Thank you,
>>
>> Philipp
>
_______________________________________________
erlang-bugs mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-bugs
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: [erlang-bugs 10] Re: Distributed node crashes silently when initially receiving a big chunk of messages from another node

Philipp Unterbrunner
I do not have a reasonably small demo yet, but I managed to get some
coredumps of beam.smp. The nodes crash with a segfault at
hipe_mode_switch.c, line 244 (of R14B02). This is code that is
responsible for calling a native code closure.

My application code does indeed send a few closures via messages, that
are later called by the receiver node. I do not use hot code upgrades
however, and the crashes are timing-related, as described before. I
therefore suspect the crashes are the result of a race condition
involving whatever code is responsible for making a received fun callable.

Philipp


On 03/29/2011 03:26 PM, [hidden email] wrote:

> Hi!
>
> This sounds really bad! A demo application that reproduces the bug
> would be really nice.
>
> Have you tried to enable core dumps to see if the erlang node crashes
> with a segfault? I suppose there are no erl_crash.dump files left
> after the crash that I can look at either?
>
> Any way to reproduce it would make it more easy to find!
>
> Cheers,
> /Patrik
>
> On Mon, 28 Mar 2011, Philipp Unterbrunner wrote:
>
>> The bug persists in r14b02.
>>
>> If I find time, I will make a small demo application so that others can
>> reproduce the bug.
>>
>> Philipp
>>
>> On 02/23/2011 04:14 PM, Philipp Unterbrunner wrote:
>>> Hello,
>>>
>>> I have run into a serious and very annoying bug.
>>>
>>> Affects (at least); R13B04, R14A, R14B, R14B01
>>> Platform: Ubuntu Linux 10.10, kernel 2.6.35-25-server (SMP)
>>>
>>> When a newly started distributed node receives a high number of
>>> messages from another node, the newly started node crashes silently.
>>> Nothing is printed to the console. No crash dump or core dump is
>>> produced.
>>>
>>> In trying to find a work-around, I found the following curious
>>> behavior:
>>>
>>> * The bug *only* occurs for distributed nodes (but regardless of
>>> whether the nodes run on the same machine).
>>> * Waiting a few seconds (or even longer) before sending the first
>>> message to the newly started node does *not* make a difference. The
>>> node will still crash when confronted with a large number of
>>> incoming messages later.
>>> * Speed matters. When doing a debug build, the bug appears less
>>> often then when doing a release build, especially when HiPE is
>>> enabled. However, I managed to cause the bug even in debug mode, and
>>> when OTP was not compiled with native libs. The bug is simply much
>>> less likely to be observed.
>>> * The number of messages sent *initially* matters most. Slowly
>>> "ramping up" the load is a work-around. Once a node is working at
>>> high throughput, it is OK to stop sending messages for an arbitrary
>>> period and at a later point send a big chunk of messages that would
>>> have killed the node if sent initially.
>>> * Timing matters. Running the receiver node with +T 7 or higher
>>> makes the problem disappear.
>>> * Setting the sender node's distribution buffer size to the minimum
>>> (+zdbbl 1) makes the problem appear less often.
>>>
>>> I have reproduced the bug in various applications. The behavior
>>> described above also makes it fairly obvious that the application is
>>> not at fault.
>>>
>>> Rather, it appears that the receiver node is unable to buffer
>>> incoming messages and crashes. Of particular interest here is the
>>> fact that "ramping up" the load is a work-around. I suspect a
>>> low-level race condition where the receiver node does not allocate
>>> sufficient buffer space in time and crashes.
>>>
>>> Given that the existing work-arounds are not desirable ("ramp up"
>>> requires changes to the application code, +T 7 and +zdbbl 1 decrease
>>> performance), and given that the bug now persists over multiple
>>> releases, I hope someone can soon look into it.
>>>
>>> Thank you,
>>>
>>> Philipp
>>

_______________________________________________
erlang-bugs mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-bugs

signature.asc (270 bytes) Download Attachment
Loading...