enif_send() and overrun heap

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

enif_send() and overrun heap

Guilherme Andrade
Hello list,

I'm bumping into a weird issue in OTP 20.3 (macOS) whereby calling enif_send() a few dozen times a second from a dirty scheduler (CPU bound), with msg_env=NULL, results in heap overrun.

These is the flow that sooner or later results in heap overrun:

1) Single Erlang process makes a specific NIF call ~25 times per second (dirty scheduler, CPU bound)
2) This call will receive a few network packets (non-blocking)
3) Each of these packets gets wrapped in a tuple (allocated in process env)
4) For each of these wrapped packets, a lookup is made in a map, passed as a NIF argument, for a process dedicated to handling this particular packet
5.a) If the lookup succeeds, enif_send() is called to dispatch the wrapped packet to said dedicated process (with msg_env=NULL) - this is what happens to most packets
5.b) If the lookup fails, the wrapped packet is accumulated and later returned to the NIF caller

Now, when total packets per second increase to a few dozen, sooner or later (sometimes as soon as after ~10 seconds) the VM stops abruptly with this error message:

> hend=0x0000000013655fa0
> stop=0x0000000013655ef8
> htop=0x00000000136562c8
> heap=0x0000000013652db0
> beam/erl_gc.c, line 708: <0.506.0>: Overrun stack and heap

(The pid mentioned above corresponds to the NIF caller.)

I tried three things (independently) that prevent the overrun from happening under this load:
A) Increasing the NIF caller heap size from the default (233 words) to 23300 words
B) Not running the NIF under a dirty scheduler
C) Not calling enif_send

Any ideas on why the overrun happens? Am I missing some very obvious transgression in the way enif_send() or dirty schedulers are supposed to be used?


--
Guilherme

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: enif_send() and overrun heap

Daniel Goertzen-3
I wonder if running under a dirty scheduler counts as "running from a created thread" in which case the first parameter to enif_send() should be NULL.

On Wed, 20 Jun 2018 at 08:30 Guilherme Andrade <[hidden email]> wrote:
Hello list,

I'm bumping into a weird issue in OTP 20.3 (macOS) whereby calling enif_send() a few dozen times a second from a dirty scheduler (CPU bound), with msg_env=NULL, results in heap overrun.

These is the flow that sooner or later results in heap overrun:

1) Single Erlang process makes a specific NIF call ~25 times per second (dirty scheduler, CPU bound)
2) This call will receive a few network packets (non-blocking)
3) Each of these packets gets wrapped in a tuple (allocated in process env)
4) For each of these wrapped packets, a lookup is made in a map, passed as a NIF argument, for a process dedicated to handling this particular packet
5.a) If the lookup succeeds, enif_send() is called to dispatch the wrapped packet to said dedicated process (with msg_env=NULL) - this is what happens to most packets
5.b) If the lookup fails, the wrapped packet is accumulated and later returned to the NIF caller

Now, when total packets per second increase to a few dozen, sooner or later (sometimes as soon as after ~10 seconds) the VM stops abruptly with this error message:

> hend=0x0000000013655fa0
> stop=0x0000000013655ef8
> htop=0x00000000136562c8
> heap=0x0000000013652db0
> beam/erl_gc.c, line 708: <0.506.0>: Overrun stack and heap

(The pid mentioned above corresponds to the NIF caller.)

I tried three things (independently) that prevent the overrun from happening under this load:
A) Increasing the NIF caller heap size from the default (233 words) to 23300 words
B) Not running the NIF under a dirty scheduler
C) Not calling enif_send

Any ideas on why the overrun happens? Am I missing some very obvious transgression in the way enif_send() or dirty schedulers are supposed to be used?


--
Guilherme
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: enif_send() and overrun heap

Guilherme Andrade

On Wed, 20 Jun 2018, 14:42 Daniel Goertzen, <[hidden email]> wrote:
I wonder if running under a dirty scheduler counts as "running from a created thread" in which case the first parameter to enif_send() should be NULL.

Good point! That seemed to do it, thanks.

May someone from OTP  team confirm dirty schedulers count as created threads for enif_send? (It makes perfect sense they would, but it's best to check.)

If confirmed, maybe I'll do a PR to make documentation more explicit about this corner case.



On Wed, 20 Jun 2018 at 08:30 Guilherme Andrade <[hidden email]> wrote:
Hello list,

I'm bumping into a weird issue in OTP 20.3 (macOS) whereby calling enif_send() a few dozen times a second from a dirty scheduler (CPU bound), with msg_env=NULL, results in heap overrun.

These is the flow that sooner or later results in heap overrun:

1) Single Erlang process makes a specific NIF call ~25 times per second (dirty scheduler, CPU bound)
2) This call will receive a few network packets (non-blocking)
3) Each of these packets gets wrapped in a tuple (allocated in process env)
4) For each of these wrapped packets, a lookup is made in a map, passed as a NIF argument, for a process dedicated to handling this particular packet
5.a) If the lookup succeeds, enif_send() is called to dispatch the wrapped packet to said dedicated process (with msg_env=NULL) - this is what happens to most packets
5.b) If the lookup fails, the wrapped packet is accumulated and later returned to the NIF caller

Now, when total packets per second increase to a few dozen, sooner or later (sometimes as soon as after ~10 seconds) the VM stops abruptly with this error message:

> hend=0x0000000013655fa0
> stop=0x0000000013655ef8
> htop=0x00000000136562c8
> heap=0x0000000013652db0
> beam/erl_gc.c, line 708: <0.506.0>: Overrun stack and heap

(The pid mentioned above corresponds to the NIF caller.)

I tried three things (independently) that prevent the overrun from happening under this load:
A) Increasing the NIF caller heap size from the default (233 words) to 23300 words
B) Not running the NIF under a dirty scheduler
C) Not calling enif_send

Any ideas on why the overrun happens? Am I missing some very obvious transgression in the way enif_send() or dirty schedulers are supposed to be used?


--
Guilherme
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: enif_send() and overrun heap

Daniel Goertzen-3
Hmm, I think there's more to it.  The source shows that dirty schedulers *are* specifically handled in enif_send(), so something else is up.



On Wed, 20 Jun 2018 at 09:18 Guilherme Andrade <[hidden email]> wrote:

On Wed, 20 Jun 2018, 14:42 Daniel Goertzen, <[hidden email]> wrote:
I wonder if running under a dirty scheduler counts as "running from a created thread" in which case the first parameter to enif_send() should be NULL.

Good point! That seemed to do it, thanks.

May someone from OTP  team confirm dirty schedulers count as created threads for enif_send? (It makes perfect sense they would, but it's best to check.)

If confirmed, maybe I'll do a PR to make documentation more explicit about this corner case.



On Wed, 20 Jun 2018 at 08:30 Guilherme Andrade <[hidden email]> wrote:
Hello list,

I'm bumping into a weird issue in OTP 20.3 (macOS) whereby calling enif_send() a few dozen times a second from a dirty scheduler (CPU bound), with msg_env=NULL, results in heap overrun.

These is the flow that sooner or later results in heap overrun:

1) Single Erlang process makes a specific NIF call ~25 times per second (dirty scheduler, CPU bound)
2) This call will receive a few network packets (non-blocking)
3) Each of these packets gets wrapped in a tuple (allocated in process env)
4) For each of these wrapped packets, a lookup is made in a map, passed as a NIF argument, for a process dedicated to handling this particular packet
5.a) If the lookup succeeds, enif_send() is called to dispatch the wrapped packet to said dedicated process (with msg_env=NULL) - this is what happens to most packets
5.b) If the lookup fails, the wrapped packet is accumulated and later returned to the NIF caller

Now, when total packets per second increase to a few dozen, sooner or later (sometimes as soon as after ~10 seconds) the VM stops abruptly with this error message:

> hend=0x0000000013655fa0
> stop=0x0000000013655ef8
> htop=0x00000000136562c8
> heap=0x0000000013652db0
> beam/erl_gc.c, line 708: <0.506.0>: Overrun stack and heap

(The pid mentioned above corresponds to the NIF caller.)

I tried three things (independently) that prevent the overrun from happening under this load:
A) Increasing the NIF caller heap size from the default (233 words) to 23300 words
B) Not running the NIF under a dirty scheduler
C) Not calling enif_send

Any ideas on why the overrun happens? Am I missing some very obvious transgression in the way enif_send() or dirty schedulers are supposed to be used?


--
Guilherme
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: enif_send() and overrun heap

Sverker Eriksson-4
Yes, a "created thread" refers to a thread that was not created by the VM
itself.

I suggest you (and everybody during NIF development) to run on a debug-VM.
It's slower but it catches a lot of faults earlier and produces a nicer core
dump. It will catch enif_* API violations such as building tuples/lists/maps
with terms from different environments.

# Build in source tree with
cd $ERL_TOP/erts/emulator
make debug

# Run in source tree with
$ERL_TOP/bin/cerl -debug      # yes, cerl with a 'c'


or if OTP-20 or newer
copy $ERL_TOP/bin/<target>/beam.debug.smp and erl_child_setup.debug
to same install dir as beam.smp and run with

erl -emu_type debug


/Sverker, Erlang/OTP Ericsson


On ons, 2018-06-20 at 10:17 -0500, Daniel Goertzen wrote:

> Hmm, I think there's more to it.  The source shows that dirty schedulers *are*
> specifically handled in enif_send(), so something else is up.
>
>
>
> On Wed, 20 Jun 2018 at 09:18 Guilherme Andrade <[hidden email]> wrote:
> >
> > On Wed, 20 Jun 2018, 14:42 Daniel Goertzen, <[hidden email]>
> > wrote:
> > > I wonder if running under a dirty scheduler counts as "running from a
> > > created thread" in which case the first parameter to enif_send() should be
> > > NULL.
> > >
> > Good point! That seemed to do it, thanks.
> >
> > May someone from OTP  team confirm dirty schedulers count as created threads
> > for enif_send? (It makes perfect sense they would, but it's best to check.)
> >
> > If confirmed, maybe I'll do a PR to make documentation more explicit about
> > this corner case.
> >
> >
> > >
> > > On Wed, 20 Jun 2018 at 08:30 Guilherme Andrade <[hidden email]> wrote:
> > > > Hello list,
> > > >
> > > > I'm bumping into a weird issue in OTP 20.3 (macOS) whereby calling
> > > > enif_send() a few dozen times a second from a dirty scheduler (CPU
> > > > bound), with msg_env=NULL, results in heap overrun.
> > > >
> > > > These is the flow that sooner or later results in heap overrun:
> > > >
> > > > 1) Single Erlang process makes a specific NIF call ~25 times per second
> > > > (dirty scheduler, CPU bound)
> > > > 2) This call will receive a few network packets (non-blocking)
> > > > 3) Each of these packets gets wrapped in a tuple (allocated in process
> > > > env)
> > > > 4) For each of these wrapped packets, a lookup is made in a map, passed
> > > > as a NIF argument, for a process dedicated to handling this particular
> > > > packet
> > > > 5.a) If the lookup succeeds, enif_send() is called to dispatch the
> > > > wrapped packet to said dedicated process (with msg_env=NULL) - this is
> > > > what happens to most packets
> > > > 5.b) If the lookup fails, the wrapped packet is accumulated and later
> > > > returned to the NIF caller
> > > >
> > > > Now, when total packets per second increase to a few dozen, sooner or
> > > > later (sometimes as soon as after ~10 seconds) the VM stops abruptly
> > > > with this error message:
> > > >
> > > > > hend=0x0000000013655fa0
> > > > > stop=0x0000000013655ef8
> > > > > htop=0x00000000136562c8
> > > > > heap=0x0000000013652db0
> > > > > beam/erl_gc.c, line 708: <0.506.0>: Overrun stack and heap
> > > >
> > > > (The pid mentioned above corresponds to the NIF caller.)
> > > >
> > > > I tried three things (independently) that prevent the overrun from
> > > > happening under this load:
> > > > A) Increasing the NIF caller heap size from the default (233 words) to
> > > > 23300 words
> > > > B) Not running the NIF under a dirty scheduler
> > > > C) Not calling enif_send
> > > >
> > > > Any ideas on why the overrun happens? Am I missing some very obvious
> > > > transgression in the way enif_send() or dirty schedulers are supposed to
> > > > be used?
> > > >
> > > >
> > > > _______________________________________________
> > > > erlang-questions mailing list
> > > > [hidden email]
> > > > http://erlang.org/mailman/listinfo/erlang-questions
> > > >
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: enif_send() and overrun heap

Guilherme Andrade
Hi,

On 20 June 2018 at 18:39, Sverker Eriksson <[hidden email]> wrote:
I suggest you (and everybody during NIF development) to run on a debug-VM.
It's slower but it catches a lot of faults earlier and produces a nicer core
dump. It will catch enif_* API violations such as building tuples/lists/maps
with terms from different environments.

I finally found the time to look into this matter again. 

Using a debug build of the VM (20.3), the crash consistently happens due to a failed assertion in line 445 of erl_nif.c[1].
Other conditions are the same as before.

Is this enlightening in anyway? I've got a 2.8 GiB core dump I can look into if you tell me that's the right direction to go.



--
Guilherme

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: enif_send() and overrun heap

Guilherme Andrade
I've distilled the flow that triggers the crash and made a demo.

    https://github.com/g-andrade/enif_send_crash

It implements a dirty NIF which sends messages to a second process using enif_send() and a message containing a heap-allocated term.
It increases the number of iterations exponentially until it crashes.

This is what happens in my machine very consistently (macOS):

> ./enif_send_crash
> Trying to crash (1 iterations)...
> Trying to crash (2 iterations)...
> Trying to crash (4 iterations)...
> Trying to crash (8 iterations)...
> Trying to crash (16 iterations)...
> Trying to crash (32 iterations)...
> Trying to crash (64 iterations)...
> Trying to crash (128 iterations)...
> Trying to crash (256 iterations)...
> Trying to crash (512 iterations)...
> Trying to crash (1024 iterations)...
> beam/erl_nif.c:445:cache_env() Assertion failed: env->hp_end == (env->proc)->stop


On 7 August 2018 at 09:38, Guilherme Andrade <[hidden email]> wrote:
Hi,

On 20 June 2018 at 18:39, Sverker Eriksson <[hidden email]> wrote:
I suggest you (and everybody during NIF development) to run on a debug-VM.
It's slower but it catches a lot of faults earlier and produces a nicer core
dump. It will catch enif_* API violations such as building tuples/lists/maps
with terms from different environments.

I finally found the time to look into this matter again. 

Using a debug build of the VM (20.3), the crash consistently happens due to a failed assertion in line 445 of erl_nif.c[1].
Other conditions are the same as before.

Is this enlightening in anyway? I've got a 2.8 GiB core dump I can look into if you tell me that's the right direction to go.



--
Guilherme



--
Guilherme

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: enif_send() and overrun heap

Guilherme Andrade
On 7 August 2018 at 11:23, Guilherme Andrade <[hidden email]> wrote:
I've distilled the flow that triggers the crash and made a demo.

    https://github.com/g-andrade/enif_send_crash

It implements a dirty NIF which sends messages to a second process using enif_send() and a message containing a heap-allocated term.
It increases the number of iterations exponentially until it crashes.

I've also created an issue[1] in the bug tracker after someone privately suggested it to me.

[1]: https://bugs.erlang.org/browse/ERL-695

 

This is what happens in my machine very consistently (macOS):

> ./enif_send_crash
> Trying to crash (1 iterations)...
> Trying to crash (2 iterations)...
> Trying to crash (4 iterations)...
> Trying to crash (8 iterations)...
> Trying to crash (16 iterations)...
> Trying to crash (32 iterations)...
> Trying to crash (64 iterations)...
> Trying to crash (128 iterations)...
> Trying to crash (256 iterations)...
> Trying to crash (512 iterations)...
> Trying to crash (1024 iterations)...
> beam/erl_nif.c:445:cache_env() Assertion failed: env->hp_end == (env->proc)->stop


On 7 August 2018 at 09:38, Guilherme Andrade <[hidden email]> wrote:
Hi,

On 20 June 2018 at 18:39, Sverker Eriksson <[hidden email]> wrote:
I suggest you (and everybody during NIF development) to run on a debug-VM.
It's slower but it catches a lot of faults earlier and produces a nicer core
dump. It will catch enif_* API violations such as building tuples/lists/maps
with terms from different environments.

I finally found the time to look into this matter again. 

Using a debug build of the VM (20.3), the crash consistently happens due to a failed assertion in line 445 of erl_nif.c[1].
Other conditions are the same as before.

Is this enlightening in anyway? I've got a 2.8 GiB core dump I can look into if you tell me that's the right direction to go.



--
Guilherme



--
Guilherme



--
Guilherme

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions