The "old" way of debugging something like this is to create a ring-buffer in the binary which tracks the latest K events and then have a way to grab that ring buffer (UNIX Signal, etc). My bet is that you have some kind of deadlock situation which stems from an assumption about threads/mutexes in Erlang and how RTEMS implements the abstraction, leading to a leaky abstraction. A way to inspect the reduction counter could also be good to have.
In general, write down what you assume in the VM state and start sprinkling assertions in. The goal is to be scientific, so verify your assumptions. The bugs often lurk where your intuition is leading you astray and you take a giant leap of faith where minute details matter and turn out to be different from what you expect.
The VM can be built in several debug modes, but I'm not sure they verify the underlying fabric is as expected.
On Tue, Feb 13, 2018 at 1:04 PM Sébastien Merle <[hidden email]> wrote:
We are working on GRiSP (grisp.org) and we are porting the Erlang VM to PowerPC/RTEMS. Everything works fine with 19.3.6 without threading (`--disable-threads`). But with PLAIN or SMP build of either Erlang 19.3.6 or 20.2 we found a strange scheduling issue. Any hints and ideas on how to debug it would be so greatly appreciated!
We have a very simple project to test the issue, it has a single supervisor starting a `proc_lib` worker that stay in a busy loop after calling `proc_lib:init_ack` and a second worker that is a normal `gen_server` doing nothing. The symptom is that the supervisor starts the first worker and never get to start the second one, we never get to the Erlang console. When tracing the supervisor module (with `dbg`) we can see it "blocks" on `supervisor:do_start_child`, and when enabling verbose logging with the debug build we can see no processes gets started. This appends all the time, it is 100% reproducible.
What makes us think it is a scheduling issue is that adding `receive after 1 -> ok end` in the busy loop seems to fix the issue and properly start the second process and get us to the Erlang console.
Our port of the same code to ARM/RTEMS is working fine, we only have this issue on PowerPC.
We cannot use VM probes because we don't have DTrace on RTEMS, printing debug in `erl_process.c` is probably not a good idea and there is no clear place where to start debugging from with a hardware debugger.
Any guidelines, hints or ideas on how to debug this?