Hello, I have run into a serious and very annoying bug. Affects (at least); R13B04, R14A, R14B, R14B01 Platform: Ubuntu Linux 10.10, kernel 2.6.35-25-server (SMP) When a newly started distributed node receives a high number of messages from another node, the newly started node crashes silently. Nothing is printed to the console. No crash dump or core dump is produced. In trying to find a work-around, I found the following curious behavior: * The bug *only* occurs for distributed nodes (but regardless of whether the nodes run on the same machine). * Waiting a few seconds (or even longer) before sending the first message to the newly started node does *not* make a difference. The node will still crash when confronted with a large number of incoming messages later. * Speed matters. When doing a debug build, the bug appears less often then when doing a release build, especially when HiPE is enabled. However, I managed to cause the bug even in debug mode, and when OTP was not compiled with native libs. The bug is simply much less likely to be observed. * The number of messages sent *initially* matters most. Slowly "ramping up" the load is a work-around. Once a node is working at high throughput, it is OK to stop sending messages for an arbitrary period and at a later point send a big chunk of messages that would have killed the node if sent initially. * Timing matters. Running the receiver node with +T 7 or higher makes the problem disappear. * Setting the sender node's distribution buffer size to the minimum (+zdbbl 1) makes the problem appear less often. I have reproduced the bug in various applications. The behavior described above also makes it fairly obvious that the application is not at fault. Rather, it appears that the receiver node is unable to buffer incoming messages and crashes. Of particular interest here is the fact that "ramping up" the load is a work-around. I suspect a low-level race condition where the receiver node does not allocate sufficient buffer space in time and crashes. Given that the existing work-arounds are not desirable ("ramp up" requires changes to the application code, +T 7 and +zdbbl 1 decrease performance), and given that the bug now persists over multiple releases, I hope someone can soon look into it. Thank you, Philipp |
|
If I find time, I will make a small demo application so that others can reproduce the bug. Philipp On 02/23/2011 04:14 PM, Philipp Unterbrunner wrote:
_______________________________________________ erlang-bugs mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-bugs |
|
Hi!
This sounds really bad! A demo application that reproduces the bug would be really nice. Have you tried to enable core dumps to see if the erlang node crashes with a segfault? I suppose there are no erl_crash.dump files left after the crash that I can look at either? Any way to reproduce it would make it more easy to find! Cheers, /Patrik On Mon, 28 Mar 2011, Philipp Unterbrunner wrote: > The bug persists in r14b02. > > If I find time, I will make a small demo application so that others can > reproduce the bug. > > Philipp > > On 02/23/2011 04:14 PM, Philipp Unterbrunner wrote: >> Hello, >> >> I have run into a serious and very annoying bug. >> >> Affects (at least); R13B04, R14A, R14B, R14B01 >> Platform: Ubuntu Linux 10.10, kernel 2.6.35-25-server (SMP) >> >> When a newly started distributed node receives a high number of messages from another node, the newly started node crashes silently. Nothing is printed to the console. No crash dump or core dump is produced. >> >> In trying to find a work-around, I found the following curious behavior: >> >> * The bug *only* occurs for distributed nodes (but regardless of whether the nodes run on the same machine). >> * Waiting a few seconds (or even longer) before sending the first message to the newly started node does *not* make a difference. The node will still crash when confronted with a large number of incoming messages later. >> * Speed matters. When doing a debug build, the bug appears less often then when doing a release build, especially when HiPE is enabled. However, I managed to cause the bug even in debug mode, and when OTP was not compiled with native libs. The bug is simply much less likely to be observed. >> * The number of messages sent *initially* matters most. Slowly "ramping up" the load is a work-around. Once a node is working at high throughput, it is OK to stop sending messages for an arbitrary period and at a later point send a big chunk of messages that would have killed the node if sent initially. >> * Timing matters. Running the receiver node with +T 7 or higher makes the problem disappear. >> * Setting the sender node's distribution buffer size to the minimum (+zdbbl 1) makes the problem appear less often. >> >> I have reproduced the bug in various applications. The behavior described above also makes it fairly obvious that the application is not at fault. >> >> Rather, it appears that the receiver node is unable to buffer incoming messages and crashes. Of particular interest here is the fact that "ramping up" the load is a work-around. I suspect a low-level race condition where the receiver node does not allocate sufficient buffer space in time and crashes. >> >> Given that the existing work-arounds are not desirable ("ramp up" requires changes to the application code, +T 7 and +zdbbl 1 decrease performance), and given that the bug now persists over multiple releases, I hope someone can soon look into it. >> >> Thank you, >> >> Philipp > erlang-bugs mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-bugs |
|
I do not have a reasonably small demo yet, but I managed to get some
coredumps of beam.smp. The nodes crash with a segfault at hipe_mode_switch.c, line 244 (of R14B02). This is code that is responsible for calling a native code closure. My application code does indeed send a few closures via messages, that are later called by the receiver node. I do not use hot code upgrades however, and the crashes are timing-related, as described before. I therefore suspect the crashes are the result of a race condition involving whatever code is responsible for making a received fun callable. Philipp On 03/29/2011 03:26 PM, [hidden email] wrote: > Hi! > > This sounds really bad! A demo application that reproduces the bug > would be really nice. > > Have you tried to enable core dumps to see if the erlang node crashes > with a segfault? I suppose there are no erl_crash.dump files left > after the crash that I can look at either? > > Any way to reproduce it would make it more easy to find! > > Cheers, > /Patrik > > On Mon, 28 Mar 2011, Philipp Unterbrunner wrote: > >> The bug persists in r14b02. >> >> If I find time, I will make a small demo application so that others can >> reproduce the bug. >> >> Philipp >> >> On 02/23/2011 04:14 PM, Philipp Unterbrunner wrote: >>> Hello, >>> >>> I have run into a serious and very annoying bug. >>> >>> Affects (at least); R13B04, R14A, R14B, R14B01 >>> Platform: Ubuntu Linux 10.10, kernel 2.6.35-25-server (SMP) >>> >>> When a newly started distributed node receives a high number of >>> messages from another node, the newly started node crashes silently. >>> Nothing is printed to the console. No crash dump or core dump is >>> produced. >>> >>> In trying to find a work-around, I found the following curious >>> behavior: >>> >>> * The bug *only* occurs for distributed nodes (but regardless of >>> whether the nodes run on the same machine). >>> * Waiting a few seconds (or even longer) before sending the first >>> message to the newly started node does *not* make a difference. The >>> node will still crash when confronted with a large number of >>> incoming messages later. >>> * Speed matters. When doing a debug build, the bug appears less >>> often then when doing a release build, especially when HiPE is >>> enabled. However, I managed to cause the bug even in debug mode, and >>> when OTP was not compiled with native libs. The bug is simply much >>> less likely to be observed. >>> * The number of messages sent *initially* matters most. Slowly >>> "ramping up" the load is a work-around. Once a node is working at >>> high throughput, it is OK to stop sending messages for an arbitrary >>> period and at a later point send a big chunk of messages that would >>> have killed the node if sent initially. >>> * Timing matters. Running the receiver node with +T 7 or higher >>> makes the problem disappear. >>> * Setting the sender node's distribution buffer size to the minimum >>> (+zdbbl 1) makes the problem appear less often. >>> >>> I have reproduced the bug in various applications. The behavior >>> described above also makes it fairly obvious that the application is >>> not at fault. >>> >>> Rather, it appears that the receiver node is unable to buffer >>> incoming messages and crashes. Of particular interest here is the >>> fact that "ramping up" the load is a work-around. I suspect a >>> low-level race condition where the receiver node does not allocate >>> sufficient buffer space in time and crashes. >>> >>> Given that the existing work-arounds are not desirable ("ramp up" >>> requires changes to the application code, +T 7 and +zdbbl 1 decrease >>> performance), and given that the bug now persists over multiple >>> releases, I hope someone can soon look into it. >>> >>> Thank you, >>> >>> Philipp >> _______________________________________________ erlang-bugs mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-bugs |
| Powered by Nabble | Edit this page |
