enif_send() and overrun heap

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

enif_send() and overrun heap

Guilherme Andrade
Hello list,

I'm bumping into a weird issue in OTP 20.3 (macOS) whereby calling enif_send() a few dozen times a second from a dirty scheduler (CPU bound), with msg_env=NULL, results in heap overrun.

These is the flow that sooner or later results in heap overrun:

1) Single Erlang process makes a specific NIF call ~25 times per second (dirty scheduler, CPU bound)
2) This call will receive a few network packets (non-blocking)
3) Each of these packets gets wrapped in a tuple (allocated in process env)
4) For each of these wrapped packets, a lookup is made in a map, passed as a NIF argument, for a process dedicated to handling this particular packet
5.a) If the lookup succeeds, enif_send() is called to dispatch the wrapped packet to said dedicated process (with msg_env=NULL) - this is what happens to most packets
5.b) If the lookup fails, the wrapped packet is accumulated and later returned to the NIF caller

Now, when total packets per second increase to a few dozen, sooner or later (sometimes as soon as after ~10 seconds) the VM stops abruptly with this error message:

> hend=0x0000000013655fa0
> stop=0x0000000013655ef8
> htop=0x00000000136562c8
> heap=0x0000000013652db0
> beam/erl_gc.c, line 708: <0.506.0>: Overrun stack and heap

(The pid mentioned above corresponds to the NIF caller.)

I tried three things (independently) that prevent the overrun from happening under this load:
A) Increasing the NIF caller heap size from the default (233 words) to 23300 words
B) Not running the NIF under a dirty scheduler
C) Not calling enif_send

Any ideas on why the overrun happens? Am I missing some very obvious transgression in the way enif_send() or dirty schedulers are supposed to be used?


--
Guilherme

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: enif_send() and overrun heap

Daniel Goertzen-3
I wonder if running under a dirty scheduler counts as "running from a created thread" in which case the first parameter to enif_send() should be NULL.

On Wed, 20 Jun 2018 at 08:30 Guilherme Andrade <[hidden email]> wrote:
Hello list,

I'm bumping into a weird issue in OTP 20.3 (macOS) whereby calling enif_send() a few dozen times a second from a dirty scheduler (CPU bound), with msg_env=NULL, results in heap overrun.

These is the flow that sooner or later results in heap overrun:

1) Single Erlang process makes a specific NIF call ~25 times per second (dirty scheduler, CPU bound)
2) This call will receive a few network packets (non-blocking)
3) Each of these packets gets wrapped in a tuple (allocated in process env)
4) For each of these wrapped packets, a lookup is made in a map, passed as a NIF argument, for a process dedicated to handling this particular packet
5.a) If the lookup succeeds, enif_send() is called to dispatch the wrapped packet to said dedicated process (with msg_env=NULL) - this is what happens to most packets
5.b) If the lookup fails, the wrapped packet is accumulated and later returned to the NIF caller

Now, when total packets per second increase to a few dozen, sooner or later (sometimes as soon as after ~10 seconds) the VM stops abruptly with this error message:

> hend=0x0000000013655fa0
> stop=0x0000000013655ef8
> htop=0x00000000136562c8
> heap=0x0000000013652db0
> beam/erl_gc.c, line 708: <0.506.0>: Overrun stack and heap

(The pid mentioned above corresponds to the NIF caller.)

I tried three things (independently) that prevent the overrun from happening under this load:
A) Increasing the NIF caller heap size from the default (233 words) to 23300 words
B) Not running the NIF under a dirty scheduler
C) Not calling enif_send

Any ideas on why the overrun happens? Am I missing some very obvious transgression in the way enif_send() or dirty schedulers are supposed to be used?


--
Guilherme
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: enif_send() and overrun heap

Guilherme Andrade

On Wed, 20 Jun 2018, 14:42 Daniel Goertzen, <[hidden email]> wrote:
I wonder if running under a dirty scheduler counts as "running from a created thread" in which case the first parameter to enif_send() should be NULL.

Good point! That seemed to do it, thanks.

May someone from OTP  team confirm dirty schedulers count as created threads for enif_send? (It makes perfect sense they would, but it's best to check.)

If confirmed, maybe I'll do a PR to make documentation more explicit about this corner case.



On Wed, 20 Jun 2018 at 08:30 Guilherme Andrade <[hidden email]> wrote:
Hello list,

I'm bumping into a weird issue in OTP 20.3 (macOS) whereby calling enif_send() a few dozen times a second from a dirty scheduler (CPU bound), with msg_env=NULL, results in heap overrun.

These is the flow that sooner or later results in heap overrun:

1) Single Erlang process makes a specific NIF call ~25 times per second (dirty scheduler, CPU bound)
2) This call will receive a few network packets (non-blocking)
3) Each of these packets gets wrapped in a tuple (allocated in process env)
4) For each of these wrapped packets, a lookup is made in a map, passed as a NIF argument, for a process dedicated to handling this particular packet
5.a) If the lookup succeeds, enif_send() is called to dispatch the wrapped packet to said dedicated process (with msg_env=NULL) - this is what happens to most packets
5.b) If the lookup fails, the wrapped packet is accumulated and later returned to the NIF caller

Now, when total packets per second increase to a few dozen, sooner or later (sometimes as soon as after ~10 seconds) the VM stops abruptly with this error message:

> hend=0x0000000013655fa0
> stop=0x0000000013655ef8
> htop=0x00000000136562c8
> heap=0x0000000013652db0
> beam/erl_gc.c, line 708: <0.506.0>: Overrun stack and heap

(The pid mentioned above corresponds to the NIF caller.)

I tried three things (independently) that prevent the overrun from happening under this load:
A) Increasing the NIF caller heap size from the default (233 words) to 23300 words
B) Not running the NIF under a dirty scheduler
C) Not calling enif_send

Any ideas on why the overrun happens? Am I missing some very obvious transgression in the way enif_send() or dirty schedulers are supposed to be used?


--
Guilherme
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: enif_send() and overrun heap

Daniel Goertzen-3
Hmm, I think there's more to it.  The source shows that dirty schedulers *are* specifically handled in enif_send(), so something else is up.



On Wed, 20 Jun 2018 at 09:18 Guilherme Andrade <[hidden email]> wrote:

On Wed, 20 Jun 2018, 14:42 Daniel Goertzen, <[hidden email]> wrote:
I wonder if running under a dirty scheduler counts as "running from a created thread" in which case the first parameter to enif_send() should be NULL.

Good point! That seemed to do it, thanks.

May someone from OTP  team confirm dirty schedulers count as created threads for enif_send? (It makes perfect sense they would, but it's best to check.)

If confirmed, maybe I'll do a PR to make documentation more explicit about this corner case.



On Wed, 20 Jun 2018 at 08:30 Guilherme Andrade <[hidden email]> wrote:
Hello list,

I'm bumping into a weird issue in OTP 20.3 (macOS) whereby calling enif_send() a few dozen times a second from a dirty scheduler (CPU bound), with msg_env=NULL, results in heap overrun.

These is the flow that sooner or later results in heap overrun:

1) Single Erlang process makes a specific NIF call ~25 times per second (dirty scheduler, CPU bound)
2) This call will receive a few network packets (non-blocking)
3) Each of these packets gets wrapped in a tuple (allocated in process env)
4) For each of these wrapped packets, a lookup is made in a map, passed as a NIF argument, for a process dedicated to handling this particular packet
5.a) If the lookup succeeds, enif_send() is called to dispatch the wrapped packet to said dedicated process (with msg_env=NULL) - this is what happens to most packets
5.b) If the lookup fails, the wrapped packet is accumulated and later returned to the NIF caller

Now, when total packets per second increase to a few dozen, sooner or later (sometimes as soon as after ~10 seconds) the VM stops abruptly with this error message:

> hend=0x0000000013655fa0
> stop=0x0000000013655ef8
> htop=0x00000000136562c8
> heap=0x0000000013652db0
> beam/erl_gc.c, line 708: <0.506.0>: Overrun stack and heap

(The pid mentioned above corresponds to the NIF caller.)

I tried three things (independently) that prevent the overrun from happening under this load:
A) Increasing the NIF caller heap size from the default (233 words) to 23300 words
B) Not running the NIF under a dirty scheduler
C) Not calling enif_send

Any ideas on why the overrun happens? Am I missing some very obvious transgression in the way enif_send() or dirty schedulers are supposed to be used?


--
Guilherme
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: enif_send() and overrun heap

Sverker Eriksson-4
Yes, a "created thread" refers to a thread that was not created by the VM
itself.

I suggest you (and everybody during NIF development) to run on a debug-VM.
It's slower but it catches a lot of faults earlier and produces a nicer core
dump. It will catch enif_* API violations such as building tuples/lists/maps
with terms from different environments.

# Build in source tree with
cd $ERL_TOP/erts/emulator
make debug

# Run in source tree with
$ERL_TOP/bin/cerl -debug      # yes, cerl with a 'c'


or if OTP-20 or newer
copy $ERL_TOP/bin/<target>/beam.debug.smp and erl_child_setup.debug
to same install dir as beam.smp and run with

erl -emu_type debug


/Sverker, Erlang/OTP Ericsson


On ons, 2018-06-20 at 10:17 -0500, Daniel Goertzen wrote:

> Hmm, I think there's more to it.  The source shows that dirty schedulers *are*
> specifically handled in enif_send(), so something else is up.
>
>
>
> On Wed, 20 Jun 2018 at 09:18 Guilherme Andrade <[hidden email]> wrote:
> >
> > On Wed, 20 Jun 2018, 14:42 Daniel Goertzen, <[hidden email]>
> > wrote:
> > > I wonder if running under a dirty scheduler counts as "running from a
> > > created thread" in which case the first parameter to enif_send() should be
> > > NULL.
> > >
> > Good point! That seemed to do it, thanks.
> >
> > May someone from OTP  team confirm dirty schedulers count as created threads
> > for enif_send? (It makes perfect sense they would, but it's best to check.)
> >
> > If confirmed, maybe I'll do a PR to make documentation more explicit about
> > this corner case.
> >
> >
> > >
> > > On Wed, 20 Jun 2018 at 08:30 Guilherme Andrade <[hidden email]> wrote:
> > > > Hello list,
> > > >
> > > > I'm bumping into a weird issue in OTP 20.3 (macOS) whereby calling
> > > > enif_send() a few dozen times a second from a dirty scheduler (CPU
> > > > bound), with msg_env=NULL, results in heap overrun.
> > > >
> > > > These is the flow that sooner or later results in heap overrun:
> > > >
> > > > 1) Single Erlang process makes a specific NIF call ~25 times per second
> > > > (dirty scheduler, CPU bound)
> > > > 2) This call will receive a few network packets (non-blocking)
> > > > 3) Each of these packets gets wrapped in a tuple (allocated in process
> > > > env)
> > > > 4) For each of these wrapped packets, a lookup is made in a map, passed
> > > > as a NIF argument, for a process dedicated to handling this particular
> > > > packet
> > > > 5.a) If the lookup succeeds, enif_send() is called to dispatch the
> > > > wrapped packet to said dedicated process (with msg_env=NULL) - this is
> > > > what happens to most packets
> > > > 5.b) If the lookup fails, the wrapped packet is accumulated and later
> > > > returned to the NIF caller
> > > >
> > > > Now, when total packets per second increase to a few dozen, sooner or
> > > > later (sometimes as soon as after ~10 seconds) the VM stops abruptly
> > > > with this error message:
> > > >
> > > > > hend=0x0000000013655fa0
> > > > > stop=0x0000000013655ef8
> > > > > htop=0x00000000136562c8
> > > > > heap=0x0000000013652db0
> > > > > beam/erl_gc.c, line 708: <0.506.0>: Overrun stack and heap
> > > >
> > > > (The pid mentioned above corresponds to the NIF caller.)
> > > >
> > > > I tried three things (independently) that prevent the overrun from
> > > > happening under this load:
> > > > A) Increasing the NIF caller heap size from the default (233 words) to
> > > > 23300 words
> > > > B) Not running the NIF under a dirty scheduler
> > > > C) Not calling enif_send
> > > >
> > > > Any ideas on why the overrun happens? Am I missing some very obvious
> > > > transgression in the way enif_send() or dirty schedulers are supposed to
> > > > be used?
> > > >
> > > >
> > > > _______________________________________________
> > > > erlang-questions mailing list
> > > > [hidden email]
> > > > http://erlang.org/mailman/listinfo/erlang-questions
> > > >
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions