How to track down intermittent segfaults in a threaded NIF

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

How to track down intermittent segfaults in a threaded NIF

Igor Clark
Hi folks, hope all well,

I have a NIF which very occasionally segfaults, intermittently and
apparently unpredictably, bringing down the VM. I've spent a bunch of
time tracing allocation and dereferencing problems in my NIF code, and
I've got rid of what seems like 99%+ of the problems - but it still
occasionally happens, and I'm having trouble tracing further, because
the crash logs show the crashed threads as doing things like these:
(each one taken from a separate log where it's the only crashed thread)


> Thread 40 Crashed:: 8_scheduler
> 0   beam.smp                          0x000000001c19980b process_main
> + 1570
>
> Thread 5 Crashed:: 3_scheduler
> 0   beam.smp                          0x000000001c01d80b process_main
> + 1570
>
> Thread 7 Crashed:: 5_scheduler
> 0   beam.smp                          0x000000001baff0b8
> lists_member_2 + 63
>
> Thread 3 Crashed:: 1_scheduler
> 0   beam.smp                          0x000000001d4b780b process_main
> + 1570
>
> Thread 5 Crashed:: 3_scheduler
> 0   beam.smp                          0x000000001fcf280b process_main
> + 1570
>
> Thread 6 Crashed:: 4_scheduler
> 0   beam.smp                          0x000000001ae290b8
> lists_member_2 + 63


I'm very confident that the problems are in my code, not in the
scheduler ;-) But without more detail, I don't know how to trace where
they're happening. When they do, there are sometimes other threads doing
things in my code (maybe 20% of the time) - but mostly not, and on the
occasions when they are, I've been unable to see what the problem might
be on the lines referenced.

It seems like it's some kind of cross-thread data access issue, but I
don't know how to track it down.

Some more context about what's going on. My NIF load() function starts a
thread which passes a callback function to a library that talks to some
hardware, which calls the callback when it has a message. It's a
separate thread because the library only calls back to the thread that
initialized it; when I ran it directly in NIF load(), it didn't call
back, but in the VM-managed thread, it works as expected. The thread
sits and waits for stuff to happen, and callbacks come when they should.

I use enif_thread_create/enif_thread_opts_create to start the thread,
and use enif_alloc/enif_free everywhere. I keep a static pointer in the
NIF to a couple of members of the state struct, as that seems the only
way to reference them in the callback function. The struct is kept in
NIF private data: I pass **priv from load() to the thread_main function,
allocate the state struct using enif_alloc in thread_main, and set priv
pointing to the state struct, also in the thread. Other NIF functions do
access members of the state struct, but only ever through
enif_priv_data( env ).

The vast majority of the time it all works perfectly, humming along very
nicely, but every now and then, without any real pattern I can see, it
just segfaults and the VM comes down. It's only happened 3 times in the
last 20+ hours of working on the app, testing & running all the while,
doing VM starts, stops, code reloads, etc. But when it happens, it's
kind of a showstopper, and I'd really like to nail it down.

This is all happening in Erlang 20.3.4 on MacOS 10.12.6 / Apple LLVM
version 9.0.0 (clang-900.0.38).

Any ideas on how/where to look next to try to track this down? Hope it's
not something structural in the above which just won't work.

Cheers,
Igor


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to track down intermittent segfaults in a threaded NIF

Sergej Jurečko
On macOS there is a quick way to get a stack trace if you compiled with debug symbols.
Open /Applications/Utilities/Console
Go to: User Reports

You will see beam.smp in there if it crashed. Click on it and you get a report what every thread was calling at the time of crash.


Regards,
Sergej

> On 28 May 2018, at 23:46, Igor Clark <[hidden email]> wrote:
>
> Hi folks, hope all well,
>
> I have a NIF which very occasionally segfaults, intermittently and apparently unpredictably, bringing down the VM. I've spent a bunch of time tracing allocation and dereferencing problems in my NIF code, and I've got rid of what seems like 99%+ of the problems - but it still occasionally happens, and I'm having trouble tracing further, because the crash logs show the crashed threads as doing things like these: (each one taken from a separate log where it's the only crashed thread)
>
>
>> Thread 40 Crashed:: 8_scheduler
>> 0   beam.smp                          0x000000001c19980b process_main + 1570
>>
>> Thread 5 Crashed:: 3_scheduler
>> 0   beam.smp                          0x000000001c01d80b process_main + 1570
>>
>> Thread 7 Crashed:: 5_scheduler
>> 0   beam.smp                          0x000000001baff0b8 lists_member_2 + 63
>>
>> Thread 3 Crashed:: 1_scheduler
>> 0   beam.smp                          0x000000001d4b780b process_main + 1570
>>
>> Thread 5 Crashed:: 3_scheduler
>> 0   beam.smp                          0x000000001fcf280b process_main + 1570
>>
>> Thread 6 Crashed:: 4_scheduler
>> 0   beam.smp                          0x000000001ae290b8 lists_member_2 + 63
>
>
> I'm very confident that the problems are in my code, not in the scheduler ;-) But without more detail, I don't know how to trace where they're happening. When they do, there are sometimes other threads doing things in my code (maybe 20% of the time) - but mostly not, and on the occasions when they are, I've been unable to see what the problem might be on the lines referenced.
>
> It seems like it's some kind of cross-thread data access issue, but I don't know how to track it down.
>
> Some more context about what's going on. My NIF load() function starts a thread which passes a callback function to a library that talks to some hardware, which calls the callback when it has a message. It's a separate thread because the library only calls back to the thread that initialized it; when I ran it directly in NIF load(), it didn't call back, but in the VM-managed thread, it works as expected. The thread sits and waits for stuff to happen, and callbacks come when they should.
>
> I use enif_thread_create/enif_thread_opts_create to start the thread, and use enif_alloc/enif_free everywhere. I keep a static pointer in the NIF to a couple of members of the state struct, as that seems the only way to reference them in the callback function. The struct is kept in NIF private data: I pass **priv from load() to the thread_main function, allocate the state struct using enif_alloc in thread_main, and set priv pointing to the state struct, also in the thread. Other NIF functions do access members of the state struct, but only ever through enif_priv_data( env ).
>
> The vast majority of the time it all works perfectly, humming along very nicely, but every now and then, without any real pattern I can see, it just segfaults and the VM comes down. It's only happened 3 times in the last 20+ hours of working on the app, testing & running all the while, doing VM starts, stops, code reloads, etc. But when it happens, it's kind of a showstopper, and I'd really like to nail it down.
>
> This is all happening in Erlang 20.3.4 on MacOS 10.12.6 / Apple LLVM version 9.0.0 (clang-900.0.38).
>
> Any ideas on how/where to look next to try to track this down? Hope it's not something structural in the above which just won't work.
>
> Cheers,
> Igor
>
>
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to track down intermittent segfaults in a threaded NIF

Igor Clark
Thanks Sergej - that's where I got the thread reports I pasted in below,
from e.g. 'beam.smp_2018-05-28-212735_Igor-Clarks-iMac.crash'.

Each log says the only crashed thread was a scheduler thread, for
example "8_scheduler" running "process_main" in the case of the first
one below. This is how I tracked down a bunch of errors in my own code,
but the only ones that still happen are in the scheduler, according to
the Console crash logs.

The thing is, it seems really unlikely that a VM running my NIF code
would just happen to be crashing in the scheduler rather than my code(!)
- so that's what I'm trying to work out, how to find out what's actually
going on, given that the log tells me the crashed thread is running
"process_main" or 'lists_member_2'.

Any suggestions welcome!

Cheers,
Igor

On 29/05/2018 04:16, Sergej Jurečko wrote:

> On macOS there is a quick way to get a stack trace if you compiled with debug symbols.
> Open /Applications/Utilities/Console
> Go to: User Reports
>
> You will see beam.smp in there if it crashed. Click on it and you get a report what every thread was calling at the time of crash.
>
>
> Regards,
> Sergej
>
>> On 28 May 2018, at 23:46, Igor Clark <[hidden email]> wrote:
>>
>> Hi folks, hope all well,
>>
>> I have a NIF which very occasionally segfaults, intermittently and apparently unpredictably, bringing down the VM. I've spent a bunch of time tracing allocation and dereferencing problems in my NIF code, and I've got rid of what seems like 99%+ of the problems - but it still occasionally happens, and I'm having trouble tracing further, because the crash logs show the crashed threads as doing things like these: (each one taken from a separate log where it's the only crashed thread)
>>
>>
>>> Thread 40 Crashed:: 8_scheduler
>>> 0   beam.smp                          0x000000001c19980b process_main + 1570
>>>
>>> Thread 5 Crashed:: 3_scheduler
>>> 0   beam.smp                          0x000000001c01d80b process_main + 1570
>>>
>>> Thread 7 Crashed:: 5_scheduler
>>> 0   beam.smp                          0x000000001baff0b8 lists_member_2 + 63
>>>
>>> Thread 3 Crashed:: 1_scheduler
>>> 0   beam.smp                          0x000000001d4b780b process_main + 1570
>>>
>>> Thread 5 Crashed:: 3_scheduler
>>> 0   beam.smp                          0x000000001fcf280b process_main + 1570
>>>
>>> Thread 6 Crashed:: 4_scheduler
>>> 0   beam.smp                          0x000000001ae290b8 lists_member_2 + 63
>>
>> I'm very confident that the problems are in my code, not in the scheduler ;-) But without more detail, I don't know how to trace where they're happening. When they do, there are sometimes other threads doing things in my code (maybe 20% of the time) - but mostly not, and on the occasions when they are, I've been unable to see what the problem might be on the lines referenced.
>>
>> It seems like it's some kind of cross-thread data access issue, but I don't know how to track it down.
>>
>> Some more context about what's going on. My NIF load() function starts a thread which passes a callback function to a library that talks to some hardware, which calls the callback when it has a message. It's a separate thread because the library only calls back to the thread that initialized it; when I ran it directly in NIF load(), it didn't call back, but in the VM-managed thread, it works as expected. The thread sits and waits for stuff to happen, and callbacks come when they should.
>>
>> I use enif_thread_create/enif_thread_opts_create to start the thread, and use enif_alloc/enif_free everywhere. I keep a static pointer in the NIF to a couple of members of the state struct, as that seems the only way to reference them in the callback function. The struct is kept in NIF private data: I pass **priv from load() to the thread_main function, allocate the state struct using enif_alloc in thread_main, and set priv pointing to the state struct, also in the thread. Other NIF functions do access members of the state struct, but only ever through enif_priv_data( env ).
>>
>> The vast majority of the time it all works perfectly, humming along very nicely, but every now and then, without any real pattern I can see, it just segfaults and the VM comes down. It's only happened 3 times in the last 20+ hours of working on the app, testing & running all the while, doing VM starts, stops, code reloads, etc. But when it happens, it's kind of a showstopper, and I'd really like to nail it down.
>>
>> This is all happening in Erlang 20.3.4 on MacOS 10.12.6 / Apple LLVM version 9.0.0 (clang-900.0.38).
>>
>> Any ideas on how/where to look next to try to track this down? Hope it's not something structural in the above which just won't work.
>>
>> Cheers,
>> Igor
>>
>>
>> _______________________________________________
>> erlang-questions mailing list
>> [hidden email]
>> http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to track down intermittent segfaults in a threaded NIF

Lukas Larsson-8
Have you tried to run your code in a debug emulator? https://github.com/erlang/otp/blob/master/HOWTO/INSTALL.md#how-to-build-a-debug-enabled-erlang-runtime-system

Since it seems to be segfaulting in lists:member/2, I would guess that your nif somehow builds an invalid list that later is used by lists:member/2.

On Tue, May 29, 2018 at 11:04 AM, Igor Clark <[hidden email]> wrote:
Thanks Sergej - that's where I got the thread reports I pasted in below, from e.g. 'beam.smp_2018-05-28-212735_Igor-Clarks-iMac.crash'.

Each log says the only crashed thread was a scheduler thread, for example "8_scheduler" running "process_main" in the case of the first one below. This is how I tracked down a bunch of errors in my own code, but the only ones that still happen are in the scheduler, according to the Console crash logs.

The thing is, it seems really unlikely that a VM running my NIF code would just happen to be crashing in the scheduler rather than my code(!) - so that's what I'm trying to work out, how to find out what's actually going on, given that the log tells me the crashed thread is running "process_main" or 'lists_member_2'.

Any suggestions welcome!

Cheers,
Igor


On 29/05/2018 04:16, Sergej Jurečko wrote:
On macOS there is a quick way to get a stack trace if you compiled with debug symbols.
Open /Applications/Utilities/Console
Go to: User Reports

You will see beam.smp in there if it crashed. Click on it and you get a report what every thread was calling at the time of crash.


Regards,
Sergej

On 28 May 2018, at 23:46, Igor Clark <[hidden email]> wrote:

Hi folks, hope all well,

I have a NIF which very occasionally segfaults, intermittently and apparently unpredictably, bringing down the VM. I've spent a bunch of time tracing allocation and dereferencing problems in my NIF code, and I've got rid of what seems like 99%+ of the problems - but it still occasionally happens, and I'm having trouble tracing further, because the crash logs show the crashed threads as doing things like these: (each one taken from a separate log where it's the only crashed thread)


Thread 40 Crashed:: 8_scheduler
0   beam.smp                          0x000000001c19980b process_main + 1570

Thread 5 Crashed:: 3_scheduler
0   beam.smp                          0x000000001c01d80b process_main + 1570

Thread 7 Crashed:: 5_scheduler
0   beam.smp                          0x000000001baff0b8 lists_member_2 + 63

Thread 3 Crashed:: 1_scheduler
0   beam.smp                          0x000000001d4b780b process_main + 1570

Thread 5 Crashed:: 3_scheduler
0   beam.smp                          0x000000001fcf280b process_main + 1570

Thread 6 Crashed:: 4_scheduler
0   beam.smp                          0x000000001ae290b8 lists_member_2 + 63

I'm very confident that the problems are in my code, not in the scheduler ;-) But without more detail, I don't know how to trace where they're happening. When they do, there are sometimes other threads doing things in my code (maybe 20% of the time) - but mostly not, and on the occasions when they are, I've been unable to see what the problem might be on the lines referenced.

It seems like it's some kind of cross-thread data access issue, but I don't know how to track it down.

Some more context about what's going on. My NIF load() function starts a thread which passes a callback function to a library that talks to some hardware, which calls the callback when it has a message. It's a separate thread because the library only calls back to the thread that initialized it; when I ran it directly in NIF load(), it didn't call back, but in the VM-managed thread, it works as expected. The thread sits and waits for stuff to happen, and callbacks come when they should.

I use enif_thread_create/enif_thread_opts_create to start the thread, and use enif_alloc/enif_free everywhere. I keep a static pointer in the NIF to a couple of members of the state struct, as that seems the only way to reference them in the callback function. The struct is kept in NIF private data: I pass **priv from load() to the thread_main function, allocate the state struct using enif_alloc in thread_main, and set priv pointing to the state struct, also in the thread. Other NIF functions do access members of the state struct, but only ever through enif_priv_data( env ).

The vast majority of the time it all works perfectly, humming along very nicely, but every now and then, without any real pattern I can see, it just segfaults and the VM comes down. It's only happened 3 times in the last 20+ hours of working on the app, testing & running all the while, doing VM starts, stops, code reloads, etc. But when it happens, it's kind of a showstopper, and I'd really like to nail it down.

This is all happening in Erlang 20.3.4 on MacOS 10.12.6 / Apple LLVM version 9.0.0 (clang-900.0.38).

Any ideas on how/where to look next to try to track this down? Hope it's not something structural in the above which just won't work.

Cheers,
Igor


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to track down intermittent segfaults in a threaded NIF

Igor Clark
Thanks very much Lukas, I think the debug emulator could be what I'm looking for. The NIF only sometimes crashes on lists:member/2 - those log lines are all from different crashes (there's only one crashed thread each time), and sometimes it just crashes on process_main. So I think I might need the debug emulator to trace further.

However I have a lot to learn about how to integrate C tooling with something so complex. When I run the debug emulator, does it just show more detailed info in stack traces, or will I need to attach gdb/lldb etc to find out what's going on? Is there any more info on how to set this all up?

Also, not 100% sure how to run it, as I run my app with "rebar3 shell" from a release layout during development, or the same inside the NIF-specific app when trying to track problems down there. The doc you linked says:

To start the debug enabled runtime system execute:

$ $ERL_TOP/bin/cerl -debug

I realise these are more rebar3 than erlang questions, but I can't find much in the rebar3 docs about them:

- How should I specify that rebar3 should run "cerl" instead of "erl" ?

- Should I just add "-debug" in my "config/vm.args" or is there another way to do this?

Thank you for your help!
i

On 29/05/2018 11:30, Lukas Larsson wrote:
Have you tried to run your code in a debug emulator? https://github.com/erlang/otp/blob/master/HOWTO/INSTALL.md#how-to-build-a-debug-enabled-erlang-runtime-system

Since it seems to be segfaulting in lists:member/2, I would guess that your nif somehow builds an invalid list that later is used by lists:member/2.

On Tue, May 29, 2018 at 11:04 AM, Igor Clark <[hidden email]> wrote:
Thanks Sergej - that's where I got the thread reports I pasted in below, from e.g. 'beam.smp_2018-05-28-212735_Igor-Clarks-iMac.crash'.

Each log says the only crashed thread was a scheduler thread, for example "8_scheduler" running "process_main" in the case of the first one below. This is how I tracked down a bunch of errors in my own code, but the only ones that still happen are in the scheduler, according to the Console crash logs.

The thing is, it seems really unlikely that a VM running my NIF code would just happen to be crashing in the scheduler rather than my code(!) - so that's what I'm trying to work out, how to find out what's actually going on, given that the log tells me the crashed thread is running "process_main" or 'lists_member_2'.

Any suggestions welcome!

Cheers,
Igor


On 29/05/2018 04:16, Sergej Jurečko wrote:
On macOS there is a quick way to get a stack trace if you compiled with debug symbols.
Open /Applications/Utilities/Console
Go to: User Reports

You will see beam.smp in there if it crashed. Click on it and you get a report what every thread was calling at the time of crash.


Regards,
Sergej

On 28 May 2018, at 23:46, Igor Clark <[hidden email]> wrote:

Hi folks, hope all well,

I have a NIF which very occasionally segfaults, intermittently and apparently unpredictably, bringing down the VM. I've spent a bunch of time tracing allocation and dereferencing problems in my NIF code, and I've got rid of what seems like 99%+ of the problems - but it still occasionally happens, and I'm having trouble tracing further, because the crash logs show the crashed threads as doing things like these: (each one taken from a separate log where it's the only crashed thread)


Thread 40 Crashed:: 8_scheduler
0   beam.smp                          0x000000001c19980b process_main + 1570

Thread 5 Crashed:: 3_scheduler
0   beam.smp                          0x000000001c01d80b process_main + 1570

Thread 7 Crashed:: 5_scheduler
0   beam.smp                          0x000000001baff0b8 lists_member_2 + 63

Thread 3 Crashed:: 1_scheduler
0   beam.smp                          0x000000001d4b780b process_main + 1570

Thread 5 Crashed:: 3_scheduler
0   beam.smp                          0x000000001fcf280b process_main + 1570

Thread 6 Crashed:: 4_scheduler
0   beam.smp                          0x000000001ae290b8 lists_member_2 + 63

I'm very confident that the problems are in my code, not in the scheduler ;-) But without more detail, I don't know how to trace where they're happening. When they do, there are sometimes other threads doing things in my code (maybe 20% of the time) - but mostly not, and on the occasions when they are, I've been unable to see what the problem might be on the lines referenced.

It seems like it's some kind of cross-thread data access issue, but I don't know how to track it down.

Some more context about what's going on. My NIF load() function starts a thread which passes a callback function to a library that talks to some hardware, which calls the callback when it has a message. It's a separate thread because the library only calls back to the thread that initialized it; when I ran it directly in NIF load(), it didn't call back, but in the VM-managed thread, it works as expected. The thread sits and waits for stuff to happen, and callbacks come when they should.

I use enif_thread_create/enif_thread_opts_create to start the thread, and use enif_alloc/enif_free everywhere. I keep a static pointer in the NIF to a couple of members of the state struct, as that seems the only way to reference them in the callback function. The struct is kept in NIF private data: I pass **priv from load() to the thread_main function, allocate the state struct using enif_alloc in thread_main, and set priv pointing to the state struct, also in the thread. Other NIF functions do access members of the state struct, but only ever through enif_priv_data( env ).

The vast majority of the time it all works perfectly, humming along very nicely, but every now and then, without any real pattern I can see, it just segfaults and the VM comes down. It's only happened 3 times in the last 20+ hours of working on the app, testing & running all the while, doing VM starts, stops, code reloads, etc. But when it happens, it's kind of a showstopper, and I'd really like to nail it down.

This is all happening in Erlang 20.3.4 on MacOS 10.12.6 / Apple LLVM version 9.0.0 (clang-900.0.38).

Any ideas on how/where to look next to try to track this down? Hope it's not something structural in the above which just won't work.

Cheers,
Igor


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to track down intermittent segfaults in a threaded NIF

scott ribe
In reply to this post by Igor Clark
What's most likely going on is that your NIF is corrupting memory that the scheduler uses. These kinds of corruptions are difficult to track down, because the crash can happen long after the bug.

You should be running with all memory debugging aids turned on--this can get you a crash closer to the actual bug, if you're lucky maybe even at the bug.

https://developer.apple.com/library/content/documentation/Performance/Conceptual/ManagingMemory/Articles/MallocDebug.html

https://developer.apple.com/library/content/technotes/tn2124/_index.html

--
Scott Ribe
[hidden email]
https://www.linkedin.com/in/scottribe/



> On May 29, 2018, at 3:04 AM, Igor Clark <[hidden email]> wrote:
>
> Thanks Sergej - that's where I got the thread reports I pasted in below, from e.g. 'beam.smp_2018-05-28-212735_Igor-Clarks-iMac.crash'.
>
> Each log says the only crashed thread was a scheduler thread, for example "8_scheduler" running "process_main" in the case of the first one below. This is how I tracked down a bunch of errors in my own code, but the only ones that still happen are in the scheduler, according to the Console crash logs.
>
> The thing is, it seems really unlikely that a VM running my NIF code would just happen to be crashing in the scheduler rather than my code(!) - so that's what I'm trying to work out, how to find out what's actually going on, given that the log tells me the crashed thread is running "process_main" or 'lists_member_2'.
>
> Any suggestions welcome!
>
> Cheers,
> Igor
>
> On 29/05/2018 04:16, Sergej Jurečko wrote:

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to track down intermittent segfaults in a threaded NIF

Lukas Larsson-8
In reply to this post by Igor Clark
I don't know how to make rebar3 run the debug emulator, but a quick and dirty trick that I do when all else fails is to copy the beam.debug.smp file over the beam.smp file.

You probably also have to copy the erl_child_setup.debug file, that file should however have the .debug suffix remaining. So:

cp bin/`erts/autoconf/config.guess`/beam.debug.smp path/to/release/erts-v.s.n/bin/beam.smp
cp bin/`erts/autoconf/config.guess`/erl_child_setup.debug path/to/release/erts-v.s.n/bin/

On Tue, May 29, 2018 at 1:30 PM, Igor Clark <[hidden email]> wrote:
Thanks very much Lukas, I think the debug emulator could be what I'm looking for. The NIF only sometimes crashes on lists:member/2 - those log lines are all from different crashes (there's only one crashed thread each time), and sometimes it just crashes on process_main. So I think I might need the debug emulator to trace further.

However I have a lot to learn about how to integrate C tooling with something so complex. When I run the debug emulator, does it just show more detailed info in stack traces, or will I need to attach gdb/lldb etc to find out what's going on? Is there any more info on how to set this all up?

Also, not 100% sure how to run it, as I run my app with "rebar3 shell" from a release layout during development, or the same inside the NIF-specific app when trying to track problems down there. The doc you linked says:

To start the debug enabled runtime system execute:

$ $ERL_TOP/bin/cerl -debug

I realise these are more rebar3 than erlang questions, but I can't find much in the rebar3 docs about them:

- How should I specify that rebar3 should run "cerl" instead of "erl" ?

- Should I just add "-debug" in my "config/vm.args" or is there another way to do this?

Thank you for your help!
i


On 29/05/2018 11:30, Lukas Larsson wrote:
Have you tried to run your code in a debug emulator? https://github.com/erlang/otp/blob/master/HOWTO/INSTALL.md#how-to-build-a-debug-enabled-erlang-runtime-system

Since it seems to be segfaulting in lists:member/2, I would guess that your nif somehow builds an invalid list that later is used by lists:member/2.

On Tue, May 29, 2018 at 11:04 AM, Igor Clark <[hidden email]> wrote:
Thanks Sergej - that's where I got the thread reports I pasted in below, from e.g. 'beam.smp_2018-05-28-212735_Igor-Clarks-iMac.crash'.

Each log says the only crashed thread was a scheduler thread, for example "8_scheduler" running "process_main" in the case of the first one below. This is how I tracked down a bunch of errors in my own code, but the only ones that still happen are in the scheduler, according to the Console crash logs.

The thing is, it seems really unlikely that a VM running my NIF code would just happen to be crashing in the scheduler rather than my code(!) - so that's what I'm trying to work out, how to find out what's actually going on, given that the log tells me the crashed thread is running "process_main" or 'lists_member_2'.

Any suggestions welcome!

Cheers,
Igor


On 29/05/2018 04:16, Sergej Jurečko wrote:
On macOS there is a quick way to get a stack trace if you compiled with debug symbols.
Open /Applications/Utilities/Console
Go to: User Reports

You will see beam.smp in there if it crashed. Click on it and you get a report what every thread was calling at the time of crash.


Regards,
Sergej

On 28 May 2018, at 23:46, Igor Clark <[hidden email]> wrote:

Hi folks, hope all well,

I have a NIF which very occasionally segfaults, intermittently and apparently unpredictably, bringing down the VM. I've spent a bunch of time tracing allocation and dereferencing problems in my NIF code, and I've got rid of what seems like 99%+ of the problems - but it still occasionally happens, and I'm having trouble tracing further, because the crash logs show the crashed threads as doing things like these: (each one taken from a separate log where it's the only crashed thread)


Thread 40 Crashed:: 8_scheduler
0   beam.smp                          0x000000001c19980b process_main + 1570

Thread 5 Crashed:: 3_scheduler
0   beam.smp                          0x000000001c01d80b process_main + 1570

Thread 7 Crashed:: 5_scheduler
0   beam.smp                          0x000000001baff0b8 lists_member_2 + 63

Thread 3 Crashed:: 1_scheduler
0   beam.smp                          0x000000001d4b780b process_main + 1570

Thread 5 Crashed:: 3_scheduler
0   beam.smp                          0x000000001fcf280b process_main + 1570

Thread 6 Crashed:: 4_scheduler
0   beam.smp                          0x000000001ae290b8 lists_member_2 + 63

I'm very confident that the problems are in my code, not in the scheduler ;-) But without more detail, I don't know how to trace where they're happening. When they do, there are sometimes other threads doing things in my code (maybe 20% of the time) - but mostly not, and on the occasions when they are, I've been unable to see what the problem might be on the lines referenced.

It seems like it's some kind of cross-thread data access issue, but I don't know how to track it down.

Some more context about what's going on. My NIF load() function starts a thread which passes a callback function to a library that talks to some hardware, which calls the callback when it has a message. It's a separate thread because the library only calls back to the thread that initialized it; when I ran it directly in NIF load(), it didn't call back, but in the VM-managed thread, it works as expected. The thread sits and waits for stuff to happen, and callbacks come when they should.

I use enif_thread_create/enif_thread_opts_create to start the thread, and use enif_alloc/enif_free everywhere. I keep a static pointer in the NIF to a couple of members of the state struct, as that seems the only way to reference them in the callback function. The struct is kept in NIF private data: I pass **priv from load() to the thread_main function, allocate the state struct using enif_alloc in thread_main, and set priv pointing to the state struct, also in the thread. Other NIF functions do access members of the state struct, but only ever through enif_priv_data( env ).

The vast majority of the time it all works perfectly, humming along very nicely, but every now and then, without any real pattern I can see, it just segfaults and the VM comes down. It's only happened 3 times in the last 20+ hours of working on the app, testing & running all the while, doing VM starts, stops, code reloads, etc. But when it happens, it's kind of a showstopper, and I'd really like to nail it down.

This is all happening in Erlang 20.3.4 on MacOS 10.12.6 / Apple LLVM version 9.0.0 (clang-900.0.38).

Any ideas on how/where to look next to try to track this down? Hope it's not something structural in the above which just won't work.

Cheers,
Igor


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to track down intermittent segfaults in a threaded NIF

Peti Gömöri-2
since OTP 20 the -emu_type flag might also work eg.:
  erl -emu_type debug

and you can put it in the vm.args file too

On Tue, May 29, 2018 at 2:45 PM, Lukas Larsson <[hidden email]> wrote:
I don't know how to make rebar3 run the debug emulator, but a quick and dirty trick that I do when all else fails is to copy the beam.debug.smp file over the beam.smp file.

You probably also have to copy the erl_child_setup.debug file, that file should however have the .debug suffix remaining. So:

cp bin/`erts/autoconf/config.guess`/beam.debug.smp path/to/release/erts-v.s.n/bin/beam.smp
cp bin/`erts/autoconf/config.guess`/erl_child_setup.debug path/to/release/erts-v.s.n/bin/


On Tue, May 29, 2018 at 1:30 PM, Igor Clark <[hidden email]> wrote:
Thanks very much Lukas, I think the debug emulator could be what I'm looking for. The NIF only sometimes crashes on lists:member/2 - those log lines are all from different crashes (there's only one crashed thread each time), and sometimes it just crashes on process_main. So I think I might need the debug emulator to trace further.

However I have a lot to learn about how to integrate C tooling with something so complex. When I run the debug emulator, does it just show more detailed info in stack traces, or will I need to attach gdb/lldb etc to find out what's going on? Is there any more info on how to set this all up?

Also, not 100% sure how to run it, as I run my app with "rebar3 shell" from a release layout during development, or the same inside the NIF-specific app when trying to track problems down there. The doc you linked says:

To start the debug enabled runtime system execute:

$ $ERL_TOP/bin/cerl -debug

I realise these are more rebar3 than erlang questions, but I can't find much in the rebar3 docs about them:

- How should I specify that rebar3 should run "cerl" instead of "erl" ?

- Should I just add "-debug" in my "config/vm.args" or is there another way to do this?

Thank you for your help!
i


On 29/05/2018 11:30, Lukas Larsson wrote:
Have you tried to run your code in a debug emulator? https://github.com/erlang/otp/blob/master/HOWTO/INSTALL.md#how-to-build-a-debug-enabled-erlang-runtime-system

Since it seems to be segfaulting in lists:member/2, I would guess that your nif somehow builds an invalid list that later is used by lists:member/2.

On Tue, May 29, 2018 at 11:04 AM, Igor Clark <[hidden email]> wrote:
Thanks Sergej - that's where I got the thread reports I pasted in below, from e.g. 'beam.smp_2018-05-28-212735_Igor-Clarks-iMac.crash'.

Each log says the only crashed thread was a scheduler thread, for example "8_scheduler" running "process_main" in the case of the first one below. This is how I tracked down a bunch of errors in my own code, but the only ones that still happen are in the scheduler, according to the Console crash logs.

The thing is, it seems really unlikely that a VM running my NIF code would just happen to be crashing in the scheduler rather than my code(!) - so that's what I'm trying to work out, how to find out what's actually going on, given that the log tells me the crashed thread is running "process_main" or 'lists_member_2'.

Any suggestions welcome!

Cheers,
Igor


On 29/05/2018 04:16, Sergej Jurečko wrote:
On macOS there is a quick way to get a stack trace if you compiled with debug symbols.
Open /Applications/Utilities/Console
Go to: User Reports

You will see beam.smp in there if it crashed. Click on it and you get a report what every thread was calling at the time of crash.


Regards,
Sergej

On 28 May 2018, at 23:46, Igor Clark <[hidden email]> wrote:

Hi folks, hope all well,

I have a NIF which very occasionally segfaults, intermittently and apparently unpredictably, bringing down the VM. I've spent a bunch of time tracing allocation and dereferencing problems in my NIF code, and I've got rid of what seems like 99%+ of the problems - but it still occasionally happens, and I'm having trouble tracing further, because the crash logs show the crashed threads as doing things like these: (each one taken from a separate log where it's the only crashed thread)


Thread 40 Crashed:: 8_scheduler
0   beam.smp                          0x000000001c19980b process_main + 1570

Thread 5 Crashed:: 3_scheduler
0   beam.smp                          0x000000001c01d80b process_main + 1570

Thread 7 Crashed:: 5_scheduler
0   beam.smp                          0x000000001baff0b8 lists_member_2 + 63

Thread 3 Crashed:: 1_scheduler
0   beam.smp                          0x000000001d4b780b process_main + 1570

Thread 5 Crashed:: 3_scheduler
0   beam.smp                          0x000000001fcf280b process_main + 1570

Thread 6 Crashed:: 4_scheduler
0   beam.smp                          0x000000001ae290b8 lists_member_2 + 63

I'm very confident that the problems are in my code, not in the scheduler ;-) But without more detail, I don't know how to trace where they're happening. When they do, there are sometimes other threads doing things in my code (maybe 20% of the time) - but mostly not, and on the occasions when they are, I've been unable to see what the problem might be on the lines referenced.

It seems like it's some kind of cross-thread data access issue, but I don't know how to track it down.

Some more context about what's going on. My NIF load() function starts a thread which passes a callback function to a library that talks to some hardware, which calls the callback when it has a message. It's a separate thread because the library only calls back to the thread that initialized it; when I ran it directly in NIF load(), it didn't call back, but in the VM-managed thread, it works as expected. The thread sits and waits for stuff to happen, and callbacks come when they should.

I use enif_thread_create/enif_thread_opts_create to start the thread, and use enif_alloc/enif_free everywhere. I keep a static pointer in the NIF to a couple of members of the state struct, as that seems the only way to reference them in the callback function. The struct is kept in NIF private data: I pass **priv from load() to the thread_main function, allocate the state struct using enif_alloc in thread_main, and set priv pointing to the state struct, also in the thread. Other NIF functions do access members of the state struct, but only ever through enif_priv_data( env ).

The vast majority of the time it all works perfectly, humming along very nicely, but every now and then, without any real pattern I can see, it just segfaults and the VM comes down. It's only happened 3 times in the last 20+ hours of working on the app, testing & running all the while, doing VM starts, stops, code reloads, etc. But when it happens, it's kind of a showstopper, and I'd really like to nail it down.

This is all happening in Erlang 20.3.4 on MacOS 10.12.6 / Apple LLVM version 9.0.0 (clang-900.0.38).

Any ideas on how/where to look next to try to track this down? Hope it's not something structural in the above which just won't work.

Cheers,
Igor


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to track down intermittent segfaults in a threaded NIF

Fred Hebert-2
In reply to this post by Igor Clark

On Tue, May 29, 2018 at 7:30 AM, Igor Clark <[hidden email]> wrote:

- How should I specify that rebar3 should run "cerl" instead of "erl" ?

You can't, especially if you're using 'rebar3 shell', where all we can do is boot a shell within an escript. Do try to debug from a release. You can probably do something like

cerl -boot _build/prod/rel/<relname>/release/<version>/<relname> -config _build/prod/rel/<relname>/release/<version>/sys

or if possible:

_build/prod/rel/<relname>/erts-<vsn>/bin/erl -boot _build/prod/rel/release/<version>/<relname> -config _build/prod/rel/release/<version>/sys -emu_type debug

These two commands will ignore the contents of your vm.args but they let you call the underlying mechanism to boot the VM with the arguments you need without the nodetool wrappers that relx ships.


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to track down intermittent segfaults in a threaded NIF

Igor Clark
In reply to this post by Peti Gömöri-2
Thanks Lukas and Peti, that's great. "erl -emu_type debug" definitely works - I haven't made the debug build yet but I get "erlexec: The emulator '/usr/local/Cellar/erlang/20.3.4/lib/erlang/erts-9.3/bin/beam.debug.smp' does not exist", which is what I want. I'll get onto the debug build and see what I can find out.

In case anyone else wants to use that in rebar3 shell, I found http://www.rebar3.org/v3.0/discuss/5745fb105528582000dfb47f which shows you can set ERL_FLAGS to just set -emu_type directly, or specify vm.args so you can then set it in there, e.g.:

ERL_FLAGS=" -args_file config/vm.args -config config/sys.config" rebar3 shell

Cheers,
Igor

On 29/05/2018 14:09, Peti Gömöri wrote:
since OTP 20 the -emu_type flag might also work eg.:
  erl -emu_type debug

and you can put it in the vm.args file too

On Tue, May 29, 2018 at 2:45 PM, Lukas Larsson <[hidden email]> wrote:
I don't know how to make rebar3 run the debug emulator, but a quick and dirty trick that I do when all else fails is to copy the beam.debug.smp file over the beam.smp file.

You probably also have to copy the erl_child_setup.debug file, that file should however have the .debug suffix remaining. So:

cp bin/`erts/autoconf/config.guess`/beam.debug.smp path/to/release/erts-v.s.n/bin/beam.smp
cp bin/`erts/autoconf/config.guess`/erl_child_setup.debug path/to/release/erts-v.s.n/bin/


On Tue, May 29, 2018 at 1:30 PM, Igor Clark <[hidden email]> wrote:
Thanks very much Lukas, I think the debug emulator could be what I'm looking for. The NIF only sometimes crashes on lists:member/2 - those log lines are all from different crashes (there's only one crashed thread each time), and sometimes it just crashes on process_main. So I think I might need the debug emulator to trace further.

However I have a lot to learn about how to integrate C tooling with something so complex. When I run the debug emulator, does it just show more detailed info in stack traces, or will I need to attach gdb/lldb etc to find out what's going on? Is there any more info on how to set this all up?

Also, not 100% sure how to run it, as I run my app with "rebar3 shell" from a release layout during development, or the same inside the NIF-specific app when trying to track problems down there. The doc you linked says:

To start the debug enabled runtime system execute:

$ $ERL_TOP/bin/cerl -debug

I realise these are more rebar3 than erlang questions, but I can't find much in the rebar3 docs about them:

- How should I specify that rebar3 should run "cerl" instead of "erl" ?

- Should I just add "-debug" in my "config/vm.args" or is there another way to do this?

Thank you for your help!
i


On 29/05/2018 11:30, Lukas Larsson wrote:
Have you tried to run your code in a debug emulator? https://github.com/erlang/otp/blob/master/HOWTO/INSTALL.md#how-to-build-a-debug-enabled-erlang-runtime-system

Since it seems to be segfaulting in lists:member/2, I would guess that your nif somehow builds an invalid list that later is used by lists:member/2.

On Tue, May 29, 2018 at 11:04 AM, Igor Clark <[hidden email]> wrote:
Thanks Sergej - that's where I got the thread reports I pasted in below, from e.g. 'beam.smp_2018-05-28-212735_Igor-Clarks-iMac.crash'.

Each log says the only crashed thread was a scheduler thread, for example "8_scheduler" running "process_main" in the case of the first one below. This is how I tracked down a bunch of errors in my own code, but the only ones that still happen are in the scheduler, according to the Console crash logs.

The thing is, it seems really unlikely that a VM running my NIF code would just happen to be crashing in the scheduler rather than my code(!) - so that's what I'm trying to work out, how to find out what's actually going on, given that the log tells me the crashed thread is running "process_main" or 'lists_member_2'.

Any suggestions welcome!

Cheers,
Igor


On 29/05/2018 04:16, Sergej Jurečko wrote:
On macOS there is a quick way to get a stack trace if you compiled with debug symbols.
Open /Applications/Utilities/Console
Go to: User Reports

You will see beam.smp in there if it crashed. Click on it and you get a report what every thread was calling at the time of crash.


Regards,
Sergej

On 28 May 2018, at 23:46, Igor Clark <[hidden email]> wrote:

Hi folks, hope all well,

I have a NIF which very occasionally segfaults, intermittently and apparently unpredictably, bringing down the VM. I've spent a bunch of time tracing allocation and dereferencing problems in my NIF code, and I've got rid of what seems like 99%+ of the problems - but it still occasionally happens, and I'm having trouble tracing further, because the crash logs show the crashed threads as doing things like these: (each one taken from a separate log where it's the only crashed thread)


Thread 40 Crashed:: 8_scheduler
0   beam.smp                          0x000000001c19980b process_main + 1570

Thread 5 Crashed:: 3_scheduler
0   beam.smp                          0x000000001c01d80b process_main + 1570

Thread 7 Crashed:: 5_scheduler
0   beam.smp                          0x000000001baff0b8 lists_member_2 + 63

Thread 3 Crashed:: 1_scheduler
0   beam.smp                          0x000000001d4b780b process_main + 1570

Thread 5 Crashed:: 3_scheduler
0   beam.smp                          0x000000001fcf280b process_main + 1570

Thread 6 Crashed:: 4_scheduler
0   beam.smp                          0x000000001ae290b8 lists_member_2 + 63

I'm very confident that the problems are in my code, not in the scheduler ;-) But without more detail, I don't know how to trace where they're happening. When they do, there are sometimes other threads doing things in my code (maybe 20% of the time) - but mostly not, and on the occasions when they are, I've been unable to see what the problem might be on the lines referenced.

It seems like it's some kind of cross-thread data access issue, but I don't know how to track it down.

Some more context about what's going on. My NIF load() function starts a thread which passes a callback function to a library that talks to some hardware, which calls the callback when it has a message. It's a separate thread because the library only calls back to the thread that initialized it; when I ran it directly in NIF load(), it didn't call back, but in the VM-managed thread, it works as expected. The thread sits and waits for stuff to happen, and callbacks come when they should.

I use enif_thread_create/enif_thread_opts_create to start the thread, and use enif_alloc/enif_free everywhere. I keep a static pointer in the NIF to a couple of members of the state struct, as that seems the only way to reference them in the callback function. The struct is kept in NIF private data: I pass **priv from load() to the thread_main function, allocate the state struct using enif_alloc in thread_main, and set priv pointing to the state struct, also in the thread. Other NIF functions do access members of the state struct, but only ever through enif_priv_data( env ).

The vast majority of the time it all works perfectly, humming along very nicely, but every now and then, without any real pattern I can see, it just segfaults and the VM comes down. It's only happened 3 times in the last 20+ hours of working on the app, testing & running all the while, doing VM starts, stops, code reloads, etc. But when it happens, it's kind of a showstopper, and I'd really like to nail it down.

This is all happening in Erlang 20.3.4 on MacOS 10.12.6 / Apple LLVM version 9.0.0 (clang-900.0.38).

Any ideas on how/where to look next to try to track this down? Hope it's not something structural in the above which just won't work.

Cheers,
Igor


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions




_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to track down intermittent segfaults in a threaded NIF

Tristan Sloughter-4
In reply to this post by Fred Hebert-2
If you need cerl this is true. But if you just need `-emu_type debug` then if you do `rebar3 local install` and follow the instructions to update your $PATH you can do:

REBAR3_ERL_ARGS="-emu_type debug" rebar3 shell


On Tue, May 29, 2018, at 7:57 AM, Fred Hebert wrote:

On Tue, May 29, 2018 at 7:30 AM, Igor Clark <[hidden email]> wrote:

- How should I specify that rebar3 should run "cerl" instead of "erl" ?

You can't, especially if you're using 'rebar3 shell', where all we can do is boot a shell within an escript. Do try to debug from a release. You can probably do something like

cerl -boot _build/prod/rel/<relname>/release/<version>/<relname> -config _build/prod/rel/<relname>/release/<version>/sys

or if possible:

_build/prod/rel/<relname>/erts-<vsn>/bin/erl -boot _build/prod/rel/release/<version>/<relname> -config _build/prod/rel/release/<version>/sys -emu_type debug

These two commands will ignore the contents of your vm.args but they let you call the underlying mechanism to boot the VM with the arguments you need without the nodetool wrappers that relx ships.
_______________________________________________
erlang-questions mailing list


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to track down intermittent segfaults in a threaded NIF

Igor Clark
In reply to this post by scott ribe
Thanks Scott, that sounds very much like what's going on and using
libgmalloc sounds like it's just what I want.

 From what I can tell from its docs, I could either do that in Xcode, or
using gdb, right? I have no clue how I'd go about running the BEAM
inside Xcode, especially booting my application and NIF code, and I've
got a lot to learn about how to connect any of this with rebar3 and
releases. Do you run BEAM in Xcode or directly with gdb/lldb?

How do the people debug the emulator versions with applications loaded?


On 29/05/2018 13:20, Scott Ribe wrote:

> What's most likely going on is that your NIF is corrupting memory that the scheduler uses. These kinds of corruptions are difficult to track down, because the crash can happen long after the bug.
>
> You should be running with all memory debugging aids turned on--this can get you a crash closer to the actual bug, if you're lucky maybe even at the bug.
>
> https://developer.apple.com/library/content/documentation/Performance/Conceptual/ManagingMemory/Articles/MallocDebug.html
>
> https://developer.apple.com/library/content/technotes/tn2124/_index.html
>
> --
> Scott Ribe
> [hidden email]
> https://www.linkedin.com/in/scottribe/
>
>
>
>> On May 29, 2018, at 3:04 AM, Igor Clark <[hidden email]> wrote:
>>
>> Thanks Sergej - that's where I got the thread reports I pasted in below, from e.g. 'beam.smp_2018-05-28-212735_Igor-Clarks-iMac.crash'.
>>
>> Each log says the only crashed thread was a scheduler thread, for example "8_scheduler" running "process_main" in the case of the first one below. This is how I tracked down a bunch of errors in my own code, but the only ones that still happen are in the scheduler, according to the Console crash logs.
>>
>> The thing is, it seems really unlikely that a VM running my NIF code would just happen to be crashing in the scheduler rather than my code(!) - so that's what I'm trying to work out, how to find out what's actually going on, given that the log tells me the crashed thread is running "process_main" or 'lists_member_2'.
>>
>> Any suggestions welcome!
>>
>> Cheers,
>> Igor
>>
>> On 29/05/2018 04:16, Sergej Jurečko wrote:

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to track down intermittent segfaults in a threaded NIF

scott ribe
> On May 29, 2018, at 8:33 AM, Igor Clark <[hidden email]> wrote:
>
> From what I can tell from its docs, I could either do that in Xcode, or using gdb, right? I have no clue how I'd go about running the BEAM inside Xcode, especially booting my application and NIF code, and I've got a lot to learn about how to connect any of this with rebar3 and releases. Do you run BEAM in Xcode or directly with gdb/lldb?

Actually, all you have to do is export those environment variables in whatever environment you launch, not just Xcode or gdb.

At launch, they barf some descriptive info to stdout (or stderr?), so if you can figure out where the beam sends that and monitor it, you should see confirmation that they're in effect:


SHR-MacBook-Pro:~ sribe$ export MallocStackLogging=1
SHR-MacBook-Pro:~ sribe$ ls
ls(32181,0x7fff88acb380) malloc: stack logs being written into /tmp/stack-logs.32181.104a26000.ls.6dB2BT.index
ls(32181,0x7fff88acb380) malloc: recording malloc and VM allocation stacks to disk using standard recorder
Applications Dropbox Public eclipse fsaclctl
Desktop Library Sites eclipse-workspace perf_demo
Desktop Folder Movies TheFindByContentFolder elixir-basics-workshop pgdata
Documents Music Trash elixir-fire-brigade-workshop
Downloads Pictures _rvm elixirconf_workshop
ls(32181,0x7fff88acb380) malloc: stack logs deleted from /tmp/stack-logs.32181.104a26000.ls.6dB2BT.index
SHR-MacBook-Pro:~ sribe$


Barring that, you could always insert a deliberate overrun in your NIF's initialization and check that it immediately crashes with an access violation.

--
Scott Ribe
[hidden email]
https://www.linkedin.com/in/scottribe/




_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to track down intermittent segfaults in a threaded NIF

Igor Clark
OK, great, thanks Scott! Combining this with Fred's help on running the
release with custom args, I have this:

> DYLD_INSERT_LIBRARIES=/usr/lib/libgmalloc.dylib MallocGuardEdges=1
> MallocStackLogging=1 _build/prod/rel/<my-app>/erts-9.3/bin/erl -boot
> _build/prod/rel/<my-app>/releases/0.1.0/<my-app> -config
> _build/prod/rel/<my-app>/releases/0.1.0/sys.config

which runs and gives this:

> beam.smp(62326,0x7fffa52133c0) malloc: stack logs being written into
> /tmp/stack-logs.62326.1f5a2000.beam.smp.QWSLfS.index

which seems good, the file is present and gets updated.

So, do I have this right: the point of the Guard Malloc is to make the
crash happen at the time of allocation, rather than delayed until
something trying to access it triggers the segfault; so if I get a crash
while running like this, I should be able to just check in the Console
debug log, and the stack trace should show where the bug actually is?

Or does this only work if I run the entire process through gdb or lldb?

cheers,
i

On 29/05/2018 15:42, Scott Ribe wrote:

>> On May 29, 2018, at 8:33 AM, Igor Clark <[hidden email]> wrote:
>>
>>  From what I can tell from its docs, I could either do that in Xcode, or using gdb, right? I have no clue how I'd go about running the BEAM inside Xcode, especially booting my application and NIF code, and I've got a lot to learn about how to connect any of this with rebar3 and releases. Do you run BEAM in Xcode or directly with gdb/lldb?
> Actually, all you have to do is export those environment variables in whatever environment you launch, not just Xcode or gdb.
>
> At launch, they barf some descriptive info to stdout (or stderr?), so if you can figure out where the beam sends that and monitor it, you should see confirmation that they're in effect:
>
>
> SHR-MacBook-Pro:~ sribe$ export MallocStackLogging=1
> SHR-MacBook-Pro:~ sribe$ ls
> ls(32181,0x7fff88acb380) malloc: stack logs being written into /tmp/stack-logs.32181.104a26000.ls.6dB2BT.index
> ls(32181,0x7fff88acb380) malloc: recording malloc and VM allocation stacks to disk using standard recorder
> Applications Dropbox Public eclipse fsaclctl
> Desktop Library Sites eclipse-workspace perf_demo
> Desktop Folder Movies TheFindByContentFolder elixir-basics-workshop pgdata
> Documents Music Trash elixir-fire-brigade-workshop
> Downloads Pictures _rvm elixirconf_workshop
> ls(32181,0x7fff88acb380) malloc: stack logs deleted from /tmp/stack-logs.32181.104a26000.ls.6dB2BT.index
> SHR-MacBook-Pro:~ sribe$
>
>
> Barring that, you could always insert a deliberate overrun in your NIF's initialization and check that it immediately crashes with an access violation.
>
> --
> Scott Ribe
> [hidden email]
> https://www.linkedin.com/in/scottribe/
>
>
>
>

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to track down intermittent segfaults in a threaded NIF

scott ribe
> On May 29, 2018, at 9:16 AM, Igor Clark <[hidden email]> wrote:
>
> So, do I have this right: the point of the Guard Malloc is to make the crash happen at the time of allocation, rather than delayed until something trying to access it triggers the segfault; so if I get a crash while running like this, I should be able to just check in the Console debug log, and the stack trace should show where the bug actually is?

At the time of the illegal access, not the allocation. Yes, that's the point, you get a stack trace showing you illegal access.

However, the BEAM allocator will reduce its effectiveness. When you malloc in your C code, you get a block set up such that accessing just past it (or potentially before it) will cause an immediate crash. When you free it, it's then set up such that accessing will cause an immediate crash. But if you use Erlang's allocation routines, Erlang may malloc a bigger block with those protections, then hand out multiple suballocations, and access beyond the end of one of those can simply corrupt the next one without crashing at that point.

You should also be using MallocScribble & MallocPreScribble.



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to track down intermittent segfaults in a threaded NIF

Igor Clark
OK. Thanks very much Scott. I've got all this working using both those
extra options, and it does seem to make the NIF crash a lot sooner than
previously, which is great. But I'm still only seeing "process_main" in
the crashed thread, so I'm not much closer to knowing where the illegal
access is. I wonder if it's in lots of places because of what I'm doing
with the callback and the thread. I hope not.

I'll do some more digging, and tomorrow I'll try out a debug emulator
build as well.

Thanks very much for helping me get this far!

On 29/05/2018 16:31, Scott Ribe wrote:

>> On May 29, 2018, at 9:16 AM, Igor Clark <[hidden email]> wrote:
>>
>> So, do I have this right: the point of the Guard Malloc is to make the crash happen at the time of allocation, rather than delayed until something trying to access it triggers the segfault; so if I get a crash while running like this, I should be able to just check in the Console debug log, and the stack trace should show where the bug actually is?
> At the time of the illegal access, not the allocation. Yes, that's the point, you get a stack trace showing you illegal access.
>
> However, the BEAM allocator will reduce its effectiveness. When you malloc in your C code, you get a block set up such that accessing just past it (or potentially before it) will cause an immediate crash. When you free it, it's then set up such that accessing will cause an immediate crash. But if you use Erlang's allocation routines, Erlang may malloc a bigger block with those protections, then hand out multiple suballocations, and access beyond the end of one of those can simply corrupt the next one without crashing at that point.
>
> You should also be using MallocScribble & MallocPreScribble.
>
>
>

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to track down intermittent segfaults in a threaded NIF

Dominic Morneau
Can you give it a try with "+Mea min" in erl options? This should make Erlang fall back to malloc for all allocators, hopefully making guard malloc more effective.

Dominic

2018年5月30日(水) 5:15 Igor Clark <[hidden email]>:
OK. Thanks very much Scott. I've got all this working using both those
extra options, and it does seem to make the NIF crash a lot sooner than
previously, which is great. But I'm still only seeing "process_main" in
the crashed thread, so I'm not much closer to knowing where the illegal
access is. I wonder if it's in lots of places because of what I'm doing
with the callback and the thread. I hope not.

I'll do some more digging, and tomorrow I'll try out a debug emulator
build as well.

Thanks very much for helping me get this far!

On 29/05/2018 16:31, Scott Ribe wrote:
>> On May 29, 2018, at 9:16 AM, Igor Clark <[hidden email]> wrote:
>>
>> So, do I have this right: the point of the Guard Malloc is to make the crash happen at the time of allocation, rather than delayed until something trying to access it triggers the segfault; so if I get a crash while running like this, I should be able to just check in the Console debug log, and the stack trace should show where the bug actually is?
> At the time of the illegal access, not the allocation. Yes, that's the point, you get a stack trace showing you illegal access.
>
> However, the BEAM allocator will reduce its effectiveness. When you malloc in your C code, you get a block set up such that accessing just past it (or potentially before it) will cause an immediate crash. When you free it, it's then set up such that accessing will cause an immediate crash. But if you use Erlang's allocation routines, Erlang may malloc a bigger block with those protections, then hand out multiple suballocations, and access beyond the end of one of those can simply corrupt the next one without crashing at that point.
>
> You should also be using MallocScribble & MallocPreScribble.
>
>
>

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to track down intermittent segfaults in a threaded NIF

Igor Clark
Thanks Dominic - I don't want to count my chickens before they've hatched, but it looks like guard malloc has pointed me to at least some bugs even without that VM option. Even though I wasn't getting a line number in the stack trace, it was already seeming to make the NIF crash immediately and consistently, so I was able to use a ton of debug print statements to track down two problems that I hadn't been able to see before. (One was an enif_alloc() in the wrong place, and another seems to have been accessing a pointer from a function in a shared object file, oops.) No way would I have seen them without guard malloc showing me the way, it's a powerful tool :-)

So I fixed those two, and right now the app is running as expected without crashes under guard malloc. I'm pretty sure that I'll come up against more illegal-access bugs over time, so I'm adding "+Mea min" to the list of options to use when I find the next one. Thank you.

Thanks very much also to everyone who replied, particularly Scott for the guard malloc suggestion & help, and Fred & Tristan for the rebar3 tips so I could add the necessary CLI options and track down what was going on. I'm very glad to have been able to ask such experienced folks for advice, and to have learned about some *extremely* useful new stuff.

Cheers,
Igor


On 29/05/2018 23:58, Dominic Morneau wrote:
Can you give it a try with "+Mea min" in erl options? This should make Erlang fall back to malloc for all allocators, hopefully making guard malloc more effective.

Dominic

2018年5月30日(水) 5:15 Igor Clark <[hidden email]>:
OK. Thanks very much Scott. I've got all this working using both those
extra options, and it does seem to make the NIF crash a lot sooner than
previously, which is great. But I'm still only seeing "process_main" in
the crashed thread, so I'm not much closer to knowing where the illegal
access is. I wonder if it's in lots of places because of what I'm doing
with the callback and the thread. I hope not.

I'll do some more digging, and tomorrow I'll try out a debug emulator
build as well.

Thanks very much for helping me get this far!

On 29/05/2018 16:31, Scott Ribe wrote:
>> On May 29, 2018, at 9:16 AM, Igor Clark <[hidden email]> wrote:
>>
>> So, do I have this right: the point of the Guard Malloc is to make the crash happen at the time of allocation, rather than delayed until something trying to access it triggers the segfault; so if I get a crash while running like this, I should be able to just check in the Console debug log, and the stack trace should show where the bug actually is?
> At the time of the illegal access, not the allocation. Yes, that's the point, you get a stack trace showing you illegal access.
>
> However, the BEAM allocator will reduce its effectiveness. When you malloc in your C code, you get a block set up such that accessing just past it (or potentially before it) will cause an immediate crash. When you free it, it's then set up such that accessing will cause an immediate crash. But if you use Erlang's allocation routines, Erlang may malloc a bigger block with those protections, then hand out multiple suballocations, and access beyond the end of one of those can simply corrupt the next one without crashing at that point.
>
> You should also be using MallocScribble & MallocPreScribble.
>
>
>

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How to track down intermittent segfaults in a threaded NIF

Igor Clark
Hey again - just closing the loop here. I had another even less frequent bug that was still triggering erts_exit(ERTS_ABORT_EXIT, ...), only happening once in a blue moon, so I fired up guard malloc with the "+Mea min" option, and this time not only did it give an immediate crash, but it also gave me the exact line number :-))

I can't tell you how useful this is - I've had these intermittent problems only showing up every now and then for a good while, with no real way to track them down, and so they've been grumbling at the back of my mind all that time. So the combination of the erl option and libgmalloc is just an amazing tool for me to hunt down this kind of issue. Thanks so much for telling me about it!

Very best,
Igor

On 30/05/2018 10:19, Igor Clark wrote:
Thanks Dominic - I don't want to count my chickens before they've hatched, but it looks like guard malloc has pointed me to at least some bugs even without that VM option. Even though I wasn't getting a line number in the stack trace, it was already seeming to make the NIF crash immediately and consistently, so I was able to use a ton of debug print statements to track down two problems that I hadn't been able to see before. (One was an enif_alloc() in the wrong place, and another seems to have been accessing a pointer from a function in a shared object file, oops.) No way would I have seen them without guard malloc showing me the way, it's a powerful tool :-)

So I fixed those two, and right now the app is running as expected without crashes under guard malloc. I'm pretty sure that I'll come up against more illegal-access bugs over time, so I'm adding "+Mea min" to the list of options to use when I find the next one. Thank you.

Thanks very much also to everyone who replied, particularly Scott for the guard malloc suggestion & help, and Fred & Tristan for the rebar3 tips so I could add the necessary CLI options and track down what was going on. I'm very glad to have been able to ask such experienced folks for advice, and to have learned about some *extremely* useful new stuff.

Cheers,
Igor


On 29/05/2018 23:58, Dominic Morneau wrote:
Can you give it a try with "+Mea min" in erl options? This should make Erlang fall back to malloc for all allocators, hopefully making guard malloc more effective.

Dominic

2018年5月30日(水) 5:15 Igor Clark <[hidden email]>:
OK. Thanks very much Scott. I've got all this working using both those
extra options, and it does seem to make the NIF crash a lot sooner than
previously, which is great. But I'm still only seeing "process_main" in
the crashed thread, so I'm not much closer to knowing where the illegal
access is. I wonder if it's in lots of places because of what I'm doing
with the callback and the thread. I hope not.

I'll do some more digging, and tomorrow I'll try out a debug emulator
build as well.

Thanks very much for helping me get this far!

On 29/05/2018 16:31, Scott Ribe wrote:
>> On May 29, 2018, at 9:16 AM, Igor Clark <[hidden email]> wrote:
>>
>> So, do I have this right: the point of the Guard Malloc is to make the crash happen at the time of allocation, rather than delayed until something trying to access it triggers the segfault; so if I get a crash while running like this, I should be able to just check in the Console debug log, and the stack trace should show where the bug actually is?
> At the time of the illegal access, not the allocation. Yes, that's the point, you get a stack trace showing you illegal access.
>
> However, the BEAM allocator will reduce its effectiveness. When you malloc in your C code, you get a block set up such that accessing just past it (or potentially before it) will cause an immediate crash. When you free it, it's then set up such that accessing will cause an immediate crash. But if you use Erlang's allocation routines, Erlang may malloc a bigger block with those protections, then hand out multiple suballocations, and access beyond the end of one of those can simply corrupt the next one without crashing at that point.
>
> You should also be using MallocScribble & MallocPreScribble.
>
>
>

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions