Re: erlang (rabbitmq) generating core on Solaris SPARC

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: erlang (rabbitmq) generating core on Solaris SPARC

Pooja Desai
Hi,

Facing erlang core issue on solaris SPARC setup while running RabbitMQ

(dbx) where

=>[1] cpool_insert(0x1004efd40, 0xffffffff75600000, 0x61850, 0xffffffff75600018, 0x90f, 0x1004effd0), at 0x10006db14

  [2] abandon_carrier(0x1004efd40, 0xffffffff75600000, 0xffffffff75645ec0, 0xffffffff77d03818, 0x0, 0x6), at 0x10006de3c

  [3] 17(0x1004efd40, 0xcb3, 0x2, 0xffffffff75645e60, 0x0, 0x1004efd40), at 0x10006e958

  [4] erts_alcu_check_delayed_dealloc(0x1004efd40, 0x1, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x8000000000000007), at 0x100075244

  [5] erts_alloc_scheduler_handle_delayed_dealloc(0xffffffff3a82a620, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x100464, 0xffffffff3a82a5d0),

at 0x1000622c0

  [6] handle_aux_work(0xffffffff3a8204a0, 0x2, 0x1, 0x2, 0x100400, 0x4e5ce123), at 0x1002a6044

  [7] erts_schedule(0xffffffff3a820380, 0x9, 0x9, 0xffffffff3a81fc80, 0x2, 0x2), at 0x1002a3040

  [8] process_main(0x100469, 0xffffffff3a302240, 0xfa0, 0x802a, 0xffffffff38f00438, 0x3), at 0x1002901bc

  [9] sched_thread_func(0xffffffff3a820380, 0x0, 0x0, 0xffffffff7a911240, 0x100000, 0x1), at 0x100038f08

  [10] thr_wrapper(0xffffffff7fffc278, 0x0, 0x0, 0x100289d48, 0xffffffff3a820380, 0x100038da0), at 0x100289dc8

 

This issue is extremely intermittent so I am not able to reproduce it with debug build. But on our test setup I have seen this core twice only for solaris Sparc server for other servers (RHEL, Suse linux, Solarisx86, Windows etc.) with similar test environment things are working fine.

In two instances when I faced this issue we are restarting Rabbitmq server. i.e. stop RabbitMQ and epmd then run startup script for rabbitmq. This performs 2 operations,

First ping rabbitmq using "rabbitmqctl ping" to confirm rabbitmq is not already running ( I guess in background this will also start epmd) and then start rabbitmq-server in detached mode.

Core is generated while starting this demon.


I checked code around abandon_carrier("https://github.com/erlang/otp/blame/master/erts/emulator/beam/erl_alloc_util.c") but nothing changed in that area recently. So I am really clueless situation.

Please le me know if anyone faced similar issue in past or have any idea around this. Using OTP version 22.2 and RabbitMQ version 3.7.23.

Let me know any further information is required, pasting full core dump information below:

debugging core file of beam.smp (64-bit) from hostname01
file: temp_dir/erlang/erts-10.6/bin/beam.smp
initial argv:
/temp_dir/erlang/erts-10.6/bin/beam.smp -- -root /temp_dir/
threading model: native threads
status: process terminated by SIGSEGV (Segmentation Fault), addr=
ffffffff004631b0

C++ symbol demangling enabled

# stack

cpool_insert+0xd0(10051c500, ffffffff7a400000, ffffffff7a441de8, ffffffff7c903818, 0, 23)
dealloc_block.part.17+0x1c0(10051c500, cb3, 2, ffffffff7a441d88, 0, 10051c500)
erts_alcu_check_delayed_dealloc+0xe4(10051c500, 1, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 8000000000000007)
erts_alloc_scheduler_handle_delayed_dealloc+0x34(ffffffff3b729c20, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 100464, ffffffff3b729bd0)
handle_aux_work+0xa50(ffffffff3b71faa0, 402, 1, 402, 100400, 42da0c68)
erts_schedule+0x192c(ffffffff3b71f980, 9, 9, ffffffff3b71f280, 402, 2)
process_main+0xc4(100469, ffffffff3b202240, fa0, ffffffff3b71f980, 241, 100294204)
sched_thread_func+0x168(ffffffff3b71f980, 0, 0, ffffffff39401a40, 100000, 1)
thr_wrapper+0x80(ffffffff7fffb318, 0, 0, 100289d48, ffffffff3b71f980, 100038da0)
libc.so.1`_lwp_start(0, 0, 0, 0, 0, 0)

#############################################################################

# registers

%g0 = 0x0000000000000000                 %l0 = 0xffffffff7a4307a0
%g1 = 0xffffffff004631a1                 %l1 = 0x0000000000000000
%g2 = 0x0000000000000000                 %l2 = 0x0000000000000000
%g3 = 0x000000010051c798                 %l3 = 0x0000000000000000
%g4 = 0xffffffff004631a0                 %l4 = 0x0000000000000000
%g5 = 0x00000001004631a0 beam.smp`firstfit_carrier_pool %l5 = 0x0000000000000000
%g6 = 0x0000000000000000                 %l6 = 0x0000000000000000
%g7 = 0xffffffff39401a40                 %l7 = 0x0000000000000000
%o0 = 0x000000010051c500                 %i0 = 0x000000010051c500
%o1 = 0xffffffff7a400000                 %i1 = 0xffffffff7a400000
%o2 = 0x00000000000676c0                 %i2 = 0xffffffff7a441de8
%o3 = 0xffffffff7a400018                 %i3 = 0xffffffff7c903818
%o4 = 0x00000000000007b9                 %i4 = 0x0000000000000000
%o5 = 0x000000010051c790                 %i5 = 0x0000000000000023
%o6 = 0xffffffff7c902eb1                 %i6 = 0xffffffff7c902f61
%o7 = 0x000000010006de3c abandon_carrier+0x118 %i7 = 0x000000010006e958 dealloc_block.part.17+0x1c0

 %ccr = 0x44 xcc=nZvc icc=nZvc
   %y = 0x0000000000000000
  %pc = 0x000000010006db14 cpool_insert+0xd0
 %npc = 0x000000010006db18 cpool_insert+0xd4
  %sp = 0xffffffff7c902eb1
  %fp = 0xffffffff7c902f61

 %asi = 0x82
%fprs = 0x00

# dissassembly around pc

cpool_insert+0xa8:              mov       %g1, %g2
cpool_insert+0xac:              ldx       [%g5 + 0x10], %g1
cpool_insert+0xb0:              membar    #LoadLoad|#LoadStore
cpool_insert+0xb4:              ba,pt     %xcc, +0x1c   <cpool_insert+0xd0>
cpool_insert+0xb8:              and       %g1, -0x4, %g4
cpool_insert+0xbc:              membar    #LoadLoad|#LoadStore
cpool_insert+0xc0:              and       %g2, 0x3, %g3
cpool_insert+0xc4:              brz,pn    %g3, +0x1ec   <cpool_insert+0x2b0>
cpool_insert+0xc8:              mov       %g2, %g1
cpool_insert+0xcc:              and       %g1, -0x4, %g4
cpool_insert+0xd0:              ld        [%g4 + 0x10], %g1
cpool_insert+0xd4:              ld        [%g4 + 0x14], %g2
cpool_insert+0xd8:              sllx      %g1, 0x20, %g1
cpool_insert+0xdc:              cmp       %g5, %g4
cpool_insert+0xe0:              bne,pt    %xcc, -0x24   <cpool_insert+0xbc>
cpool_insert+0xe4:              or        %g2, %g1, %g2
cpool_insert+0xe8:              membar    #LoadLoad|#LoadStore
cpool_insert+0xec:              mov       0x100, %o5
cpool_insert+0xf0:              add       %g4, 0x10, %o4
cpool_insert+0xf4:              mov       %g2, %g3
cpool_insert+0xf8:              and       %g3, 0x1, %g1

# all threads

stack pointer for thread 1: ffffffff7fffa961
[ ffffffff7fffa961 libc.so.1`__pollsys+8() ]
  ffffffff7fffaa11 libc.so.1`pselect+0x1fc()
  ffffffff7fffaad1 libc.so.1`select+0xa4()
  ffffffff7fffab91 erts_sys_main_thread+0x24()
  ffffffff7fffac41 erl_start+0x232c()
  ffffffff7fffb0f1 main+0xc()
  ffffffff7fffb1a1 _start+0x7c()
stack pointer for thread 2: ffffffff396fb501
[ ffffffff396fb501 libc.so.1`__read+0xc() ]
  ffffffff396fb5b1 signal_dispatcher_thread_func+0x58()
  ffffffff396fb691 thr_wrapper+0x80()
  ffffffff396fb751 libc.so.1`_lwp_start()
stack pointer for thread 3: ffffffff7d6fb1f1
[ ffffffff7d6fb1f1 libc.so.1`__lwp_park+0x14() ]
  ffffffff7d6fb2b1 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7d6fb361 libc.so.1`cond_wait+0x10()
  ffffffff7d6fb411 libc.so.1`pthread_cond_wait+8()
  ffffffff7d6fb4c1 ethr_cond_wait+8()
  ffffffff7d6fb571 sys_msg_dispatcher_func+0x1c0()
  ffffffff7d6fb691 thr_wrapper+0x80()
  ffffffff7d6fb751 libc.so.1`_lwp_start()
stack pointer for thread 4: ffffffff7d21f201
[ ffffffff7d21f201 libc.so.1`__lwp_park+0x14() ]
  ffffffff7d21f2c1 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7d21f371 libc.so.1`cond_wait+0x10()
  ffffffff7d21f421 libc.so.1`pthread_cond_wait+8()
  ffffffff7d21f4d1 wait__.constprop.1+0x3c8()
  ffffffff7d223591 async_main+0x2f0()
  ffffffff7d223691 thr_wrapper+0x80()
  ffffffff7d223751 libc.so.1`_lwp_start()
stack pointer for thread 5: ffffffff7c902eb1
[ ffffffff7c902eb1 cpool_insert+0xd0() ]
  ffffffff7c902f61 dealloc_block.part.17+0x1c0()
  ffffffff7c903021 erts_alcu_check_delayed_dealloc+0xe4()
  ffffffff7c9030e1 erts_alloc_scheduler_handle_delayed_dealloc+0x34()
  ffffffff7c903191 handle_aux_work+0xa50()
  ffffffff7c903251 erts_schedule+0x192c()
  ffffffff7c9033c1 process_main+0xc4()
  ffffffff7c9035b1 sched_thread_func+0x168()
  ffffffff7c903691 thr_wrapper+0x80()
  ffffffff7c903751 libc.so.1`_lwp_start()
stack pointer for thread 6: ffffffff7c703141
[ ffffffff7c703141 erts_find_export_entry+0x7c() ]
  ffffffff7c703301 prepare_loading_2+0x68()
  ffffffff7c7033c1 process_main+0xcf0()
  ffffffff7c7035b1 sched_thread_func+0x168()
  ffffffff7c703691 thr_wrapper+0x80()
  ffffffff7c703751 libc.so.1`_lwp_start()
stack pointer for thread 7: ffffffff7befed41
[ ffffffff7befed41 libc.so.1`__lwp_park+0x14() ]
  ffffffff7befee01 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7befeeb1 libc.so.1`cond_wait_common+0x28c()
  ffffffff7befef71 libc.so.1`__cond_timedwait+0x8c()
  ffffffff7beff031 libc.so.1`cond_timedwait+0x14()
  ffffffff7beff0e1 libc.so.1`pthread_cond_timedwait+0xc()
  ffffffff7beff191 wait__.constprop.1+0x308()
  ffffffff7bf03251 erts_schedule+0x1de0()
  ffffffff7bf033c1 process_main+0xc4()
  ffffffff7bf035b1 sched_thread_func+0x168()
  ffffffff7bf03691 thr_wrapper+0x80()
  ffffffff7bf03751 libc.so.1`_lwp_start()
stack pointer for thread 8: ffffffff7bd02eb1
[ ffffffff7bd02eb1 mbc_free+0x174() ]
  ffffffff7bd02f61 dealloc_block.part.17+0x1c0()
  ffffffff7bd03021 erts_alcu_check_delayed_dealloc+0xe4()
  ffffffff7bd030e1 erts_alloc_scheduler_handle_delayed_dealloc+0x34()
  ffffffff7bd03191 handle_aux_work+0xa50()
  ffffffff7bd03251 erts_schedule+0x192c()
  ffffffff7bd033c1 process_main+0xc4()
  ffffffff7bd035b1 sched_thread_func+0x168()
  ffffffff7bd03691 thr_wrapper+0x80()
  ffffffff7bd03751 libc.so.1`_lwp_start()
stack pointer for thread 9: ffffffff7bb4eff1
[ ffffffff7bb4eff1 libc.so.1`__lwp_park+0x14() ]
  ffffffff7bb4f0b1 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7bb4f161 libc.so.1`cond_wait+0x10()
  ffffffff7bb4f211 libc.so.1`pthread_cond_wait+8()
  ffffffff7bb4f2c1 wait__.constprop.1+0x3c8()
  ffffffff7bb53381 erts_schedule+0x1de0()
  ffffffff7bb534f1 erts_dirty_process_main+0x1dc()
  ffffffff7bb535b1 sched_dirty_cpu_thread_func+0xd0()
  ffffffff7bb53691 thr_wrapper+0x80()
  ffffffff7bb53751 libc.so.1`_lwp_start()
stack pointer for thread a: ffffffff7ba4eff1
[ ffffffff7ba4eff1 libc.so.1`__lwp_park+0x14() ]
  ffffffff7ba4f0b1 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7ba4f161 libc.so.1`cond_wait+0x10()
  ffffffff7ba4f211 libc.so.1`pthread_cond_wait+8()
  ffffffff7ba4f2c1 wait__.constprop.1+0x3c8()
  ffffffff7ba53381 erts_schedule+0x1de0()
  ffffffff7ba534f1 erts_dirty_process_main+0x1dc()
  ffffffff7ba535b1 sched_dirty_cpu_thread_func+0xd0()
  ffffffff7ba53691 thr_wrapper+0x80()
  ffffffff7ba53751 libc.so.1`_lwp_start()
stack pointer for thread b: ffffffff7b94eff1
[ ffffffff7b94eff1 libc.so.1`__lwp_park+0x14() ]
  ffffffff7b94f0b1 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7b94f161 libc.so.1`cond_wait+0x10()
  ffffffff7b94f211 libc.so.1`pthread_cond_wait+8()
  ffffffff7b94f2c1 wait__.constprop.1+0x3c8()
  ffffffff7b953381 erts_schedule+0x1de0()
  ffffffff7b9534f1 erts_dirty_process_main+0x1dc()
  ffffffff7b9535b1 sched_dirty_cpu_thread_func+0xd0()
  ffffffff7b953691 thr_wrapper+0x80()
  ffffffff7b953751 libc.so.1`_lwp_start()
stack pointer for thread c: ffffffff7b84eff1
[ ffffffff7b84eff1 libc.so.1`__lwp_park+0x14() ]
  ffffffff7b84f0b1 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7b84f161 libc.so.1`cond_wait+0x10()
  ffffffff7b84f211 libc.so.1`pthread_cond_wait+8()
  ffffffff7b84f2c1 wait__.constprop.1+0x3c8()
  ffffffff7b853381 erts_schedule+0x1de0()
  ffffffff7b8534f1 erts_dirty_process_main+0x1dc()
  ffffffff7b8535b1 sched_dirty_cpu_thread_func+0xd0()
  ffffffff7b853691 thr_wrapper+0x80()
  ffffffff7b853751 libc.so.1`_lwp_start()
stack pointer for thread d: ffffffff7b74eff1
[ ffffffff7b74eff1 libc.so.1`__lwp_park+0x14() ]
  ffffffff7b74f0b1 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7b74f161 libc.so.1`cond_wait+0x10()
  ffffffff7b74f211 libc.so.1`pthread_cond_wait+8()
  ffffffff7b74f2c1 wait__.constprop.1+0x3c8()
  ffffffff7b753381 erts_schedule+0x1de0()
  ffffffff7b7534f1 erts_dirty_process_main+0x78()
  ffffffff7b7535b1 sched_dirty_io_thread_func+0xe4()
  ffffffff7b753691 thr_wrapper+0x80()
  ffffffff7b753751 libc.so.1`_lwp_start()
stack pointer for thread e: ffffffff7b64eff1
[ ffffffff7b64eff1 libc.so.1`__lwp_park+0x14() ]
  ffffffff7b64f0b1 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7b64f161 libc.so.1`cond_wait+0x10()
  ffffffff7b64f211 libc.so.1`pthread_cond_wait+8()
  ffffffff7b64f2c1 wait__.constprop.1+0x3c8()
  ffffffff7b653381 erts_schedule+0x1de0()
  ffffffff7b6534f1 erts_dirty_process_main+0x78()
  ffffffff7b6535b1 sched_dirty_io_thread_func+0xe4()
  ffffffff7b653691 thr_wrapper+0x80()
  ffffffff7b653751 libc.so.1`_lwp_start()
stack pointer for thread 10: ffffffff7b44eff1
[ ffffffff7b44eff1 libc.so.1`__lwp_park+0x14() ]
  ffffffff7b44f0b1 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7b44f161 libc.so.1`cond_wait+0x10()
  ffffffff7b44f211 libc.so.1`pthread_cond_wait+8()
  ffffffff7b44f2c1 wait__.constprop.1+0x3c8()
  ffffffff7b453381 erts_schedule+0x1de0()
  ffffffff7b4534f1 erts_dirty_process_main+0x288()
  ffffffff7b4535b1 sched_dirty_io_thread_func+0xe4()
  ffffffff7b453691 thr_wrapper+0x80()
  ffffffff7b453751 libc.so.1`_lwp_start()
stack pointer for thread 11: ffffffff7b34eff1
[ ffffffff7b34eff1 libc.so.1`__lwp_park+0x14() ]
  ffffffff7b34f0b1 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7b34f161 libc.so.1`cond_wait+0x10()
  ffffffff7b34f211 libc.so.1`pthread_cond_wait+8()
  ffffffff7b34f2c1 wait__.constprop.1+0x3c8()
  ffffffff7b353381 erts_schedule+0x1de0()
  ffffffff7b3534f1 erts_dirty_process_main+0x288()
  ffffffff7b3535b1 sched_dirty_io_thread_func+0xe4()
  ffffffff7b353691 thr_wrapper+0x80()
  ffffffff7b353751 libc.so.1`_lwp_start()
stack pointer for thread 12: ffffffff7b24eff1
[ ffffffff7b24eff1 libc.so.1`__lwp_park+0x14() ]
  ffffffff7b24f0b1 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7b24f161 libc.so.1`cond_wait+0x10()
  ffffffff7b24f211 libc.so.1`pthread_cond_wait+8()
  ffffffff7b24f2c1 wait__.constprop.1+0x3c8()
  ffffffff7b253381 erts_schedule+0x1de0()
  ffffffff7b2534f1 erts_dirty_process_main+0x288()
  ffffffff7b2535b1 sched_dirty_io_thread_func+0xe4()
  ffffffff7b253691 thr_wrapper+0x80()
  ffffffff7b253751 libc.so.1`_lwp_start()
stack pointer for thread 13: ffffffff7b1532d1
[ ffffffff7b1532d1 sched_spin_wait+0x17c() ]
  ffffffff7b153381 erts_schedule+0x19d0()
  ffffffff7b1534f1 erts_dirty_process_main+0x78()
  ffffffff7b1535b1 sched_dirty_io_thread_func+0xe4()
  ffffffff7b153691 thr_wrapper+0x80()
  ffffffff7b153751 libc.so.1`_lwp_start()
stack pointer for thread 14: ffffffff7b04eff1
[ ffffffff7b04eff1 libc.so.1`__lwp_park+0x14() ]
  ffffffff7b04f0b1 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7b04f161 libc.so.1`cond_wait+0x10()
  ffffffff7b04f211 libc.so.1`pthread_cond_wait+8()
  ffffffff7b04f2c1 wait__.constprop.1+0x3c8()
  ffffffff7b053381 erts_schedule+0x1de0()
  ffffffff7b0534f1 erts_dirty_process_main+0x78()
  ffffffff7b0535b1 sched_dirty_io_thread_func+0xe4()
  ffffffff7b053691 thr_wrapper+0x80()
  ffffffff7b053751 libc.so.1`_lwp_start()
stack pointer for thread 15: ffffffff7af4eff1
[ ffffffff7af4eff1 libc.so.1`__lwp_park+0x14() ]
  ffffffff7af4f0b1 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7af4f161 libc.so.1`cond_wait+0x10()
  ffffffff7af4f211 libc.so.1`pthread_cond_wait+8()
  ffffffff7af4f2c1 wait__.constprop.1+0x3c8()
  ffffffff7af53381 erts_schedule+0x1de0()
  ffffffff7af534f1 erts_dirty_process_main+0x78()
  ffffffff7af535b1 sched_dirty_io_thread_func+0xe4()
  ffffffff7af53691 thr_wrapper+0x80()
  ffffffff7af53751 libc.so.1`_lwp_start()
stack pointer for thread 16: ffffffff7ae4eff1
[ ffffffff7ae4eff1 libc.so.1`__lwp_park+0x14() ]
  ffffffff7ae4f0b1 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7ae4f161 libc.so.1`cond_wait+0x10()
  ffffffff7ae4f211 libc.so.1`pthread_cond_wait+8()
  ffffffff7ae4f2c1 wait__.constprop.1+0x3c8()
  ffffffff7ae53381 erts_schedule+0x1de0()
  ffffffff7ae534f1 erts_dirty_process_main+0x78()
  ffffffff7ae535b1 sched_dirty_io_thread_func+0xe4()
  ffffffff7ae53691 thr_wrapper+0x80()
  ffffffff7ae53751 libc.so.1`_lwp_start()
stack pointer for thread 17: ffffffff7ad4f221
[ ffffffff7ad4f221 libc.so.1`__lwp_park+0x14() ]
  ffffffff7ad4f2e1 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7ad4f391 libc.so.1`cond_wait+0x10()
  ffffffff7ad4f441 libc.so.1`pthread_cond_wait+8()
  ffffffff7ad4f4f1 wait__.constprop.1+0x3c8()
  ffffffff7ad535b1 aux_thread+0x2ec()
  ffffffff7ad53691 thr_wrapper+0x80()
  ffffffff7ad53751 libc.so.1`_lwp_start()
stack pointer for thread 18: ffffffff7ac533d1
[ ffffffff7ac533d1 libc.so.1`ioctl+0xc() ]
  ffffffff7ac534c1 erts_check_io+0x54()
  ffffffff7ac535b1 poll_thread+0x208()
  ffffffff7ac53691 thr_wrapper+0x80()
  ffffffff7ac53751 libc.so.1`_lwp_start()

# object mappings

            BASE            LIMIT             SIZE NAME
       100000000        100340000           340000 temp_dir/erlang/erts-10.6/bin/beam.smp
ffffffff73c00000 ffffffff73c02000             2000 /lib/sparcv9/libsendfile.so.1
ffffffff3c700000 ffffffff3c706000             6000 /lib/sparcv9/libdlpi.so.1
ffffffff786fe000 ffffffff78700000             2000 /lib/sparcv9/libdl.so.1
ffffffff7f200000 ffffffff7f2b0000            b0000 /lib/sparcv9/libm.so.2
ffffffff7f000000 ffffffff7f010000            10000 /lib/sparcv9/libsocket.so.1
ffffffff7ee00000 ffffffff7ee70000            70000 /lib/sparcv9/libnsl.so.1
ffffffff75f00000 ffffffff75f02000             2000 /lib/sparcv9/libkstat.so.1
ffffffff7eafc000 ffffffff7eb00000             4000 /lib/sparcv9/libpthread.so.1
ffffffff786f8000 ffffffff786fa000             2000 /lib/sparcv9/librt.so.1
ffffffff7eb00000 ffffffff7ec80000           180000 /lib/sparcv9/libc.so.1
ffffffff7e000000 ffffffff7e400000           400000 /usr/lib/locale/en_US.UTF-8/sparcv9/en_US.UTF-8.so.3
ffffffff7de00000 ffffffff7de10000            10000 /usr/lib/locale/en_US.UTF-8/sparcv9/methods_unicode.so.3
ffffffff66900000 ffffffff66902000             2000 /usr/lib/sparcv9/libsctp.so.1
ffffffff7cc00000 ffffffff7cd30000           130000 /lib/sparcv9/libucrypto.so.1
ffffffff7c200000 ffffffff7c210000            10000 /lib/sparcv9/libcryptoutil.so.1
ffffffff7c000000 ffffffff7c030000            30000 /lib/sparcv9/libelf.so.1
ffffffff7ca00000 ffffffff7ca10000            10000 /lib/sparcv9/libz.so.1
ffffffff75d00000 ffffffff75d04000             4000 /lib/sparcv9/libmp.so.2
ffffffff7f500000 ffffffff7f540000            40000 /lib/sparcv9/ld.so.1

# machine information
Hostname: hostname01
Release: 5.11
Kernel architecture: sun4v
Application architecture: sparcv9
Kernel version: SunOS 5.11 sun4v 11.3
Platform: sun4v



argv[0]: /temp_dir/erlang/erts-10.6/bin/beam.smp
argv[1]: --
argv[2]: -root
argv[3]: /temp_dir/erlang
argv[4]: -progname
argv[5]: erl
argv[6]: --
argv[7]: -home
argv[8]: shared/global/mqbroker/mqhome
argv[9]: -epmd_port
argv[10]: 13778
argv[11]: --
argv[12]: -boot
argv[13]: no_dot_erlang
argv[14]: -sname
argv[15]: epmd-starter-25205088
argv[16]: -noshell
argv[17]: -noinput
argv[18]: -s
argv[19]: erlang
argv[20]: halt
argv[21]: --


# uname -a
SunOS hostname01 5.11 11.3 sun4v sparc sun4v


Thanks,

Pooja

Reply | Threaded
Open this post in threaded view
|

Re: erlang (rabbitmq) generating core on Solaris SPARC

Mikael Pettersson-5
Hello Pooja,

On Mon, May 11, 2020 at 8:10 AM Pooja Desai <[hidden email]> wrote:
>
> Hi,
>
> Facing erlang core issue on solaris SPARC setup while running RabbitMQ

This looks like a 64-bit build, but the code doesn't look similar to
what I get with gcc-9.3, so I'm assuming you used Sun's compiler?


> (dbx) where
>
> =>[1] cpool_insert(0x1004efd40, 0xffffffff75600000, 0x61850, 0xffffffff75600018, 0x90f, 0x1004effd0), at 0x10006db14
>
>   [2] abandon_carrier(0x1004efd40, 0xffffffff75600000, 0xffffffff75645ec0, 0xffffffff77d03818, 0x0, 0x6), at 0x10006de3c
>
>   [3] 17(0x1004efd40, 0xcb3, 0x2, 0xffffffff75645e60, 0x0, 0x1004efd40), at 0x10006e958
>
>   [4] erts_alcu_check_delayed_dealloc(0x1004efd40, 0x1, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x8000000000000007), at 0x100075244
>
>   [5] erts_alloc_scheduler_handle_delayed_dealloc(0xffffffff3a82a620, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x100464, 0xffffffff3a82a5d0),
>
> at 0x1000622c0
>
>   [6] handle_aux_work(0xffffffff3a8204a0, 0x2, 0x1, 0x2, 0x100400, 0x4e5ce123), at 0x1002a6044
>
>   [7] erts_schedule(0xffffffff3a820380, 0x9, 0x9, 0xffffffff3a81fc80, 0x2, 0x2), at 0x1002a3040
>
>   [8] process_main(0x100469, 0xffffffff3a302240, 0xfa0, 0x802a, 0xffffffff38f00438, 0x3), at 0x1002901bc
>
>   [9] sched_thread_func(0xffffffff3a820380, 0x0, 0x0, 0xffffffff7a911240, 0x100000, 0x1), at 0x100038f08
>
>   [10] thr_wrapper(0xffffffff7fffc278, 0x0, 0x0, 0x100289d48, 0xffffffff3a820380, 0x100038da0), at 0x100289dc8
>
>
>
> This issue is extremely intermittent so I am not able to reproduce it with debug build. But on our test setup I have seen this core twice only for solaris Sparc server for other servers (RHEL, Suse linux, Solarisx86, Windows etc.) with similar test environment things are working fine.
>
> In two instances when I faced this issue we are restarting Rabbitmq server. i.e. stop RabbitMQ and epmd then run startup script for rabbitmq. This performs 2 operations,
>
> First ping rabbitmq using "rabbitmqctl ping" to confirm rabbitmq is not already running ( I guess in background this will also start epmd) and then start rabbitmq-server in detached mode.
>
> Core is generated while starting this demon.
>
>
> I checked code around abandon_carrier("https://github.com/erlang/otp/blame/master/erts/emulator/beam/erl_alloc_util.c") but nothing changed in that area recently. So I am really clueless situation.
>
> Please le me know if anyone faced similar issue in past or have any idea around this. Using OTP version 22.2 and RabbitMQ version 3.7.23.
>
> Let me know any further information is required, pasting full core dump information below:
>
> debugging core file of beam.smp (64-bit) from hostname01
> file: temp_dir/erlang/erts-10.6/bin/beam.smp
> initial argv:
> /temp_dir/erlang/erts-10.6/bin/beam.smp -- -root /temp_dir/
> threading model: native threads
> status: process terminated by SIGSEGV (Segmentation Fault), addr=
> ffffffff004631b0

Ok, this tells us the address was unmapped.  (It's not an alignment
fault, another common issue on SPARC.)


>
> C++ symbol demangling enabled
>
> # stack
>
> cpool_insert+0xd0(10051c500, ffffffff7a400000, ffffffff7a441de8, ffffffff7c903818, 0, 23)
> dealloc_block.part.17+0x1c0(10051c500, cb3, 2, ffffffff7a441d88, 0, 10051c500)
> erts_alcu_check_delayed_dealloc+0xe4(10051c500, 1, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 8000000000000007)
> erts_alloc_scheduler_handle_delayed_dealloc+0x34(ffffffff3b729c20, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 100464, ffffffff3b729bd0)
> handle_aux_work+0xa50(ffffffff3b71faa0, 402, 1, 402, 100400, 42da0c68)
> erts_schedule+0x192c(ffffffff3b71f980, 9, 9, ffffffff3b71f280, 402, 2)
> process_main+0xc4(100469, ffffffff3b202240, fa0, ffffffff3b71f980, 241, 100294204)
> sched_thread_func+0x168(ffffffff3b71f980, 0, 0, ffffffff39401a40, 100000, 1)
> thr_wrapper+0x80(ffffffff7fffb318, 0, 0, 100289d48, ffffffff3b71f980, 100038da0)
> libc.so.1`_lwp_start(0, 0, 0, 0, 0, 0)
>
> #############################################################################
>
> # registers
>
> %g0 = 0x0000000000000000                 %l0 = 0xffffffff7a4307a0
> %g1 = 0xffffffff004631a1                 %l1 = 0x0000000000000000
> %g2 = 0x0000000000000000                 %l2 = 0x0000000000000000
> %g3 = 0x000000010051c798                 %l3 = 0x0000000000000000
> %g4 = 0xffffffff004631a0                 %l4 = 0x0000000000000000
> %g5 = 0x00000001004631a0 beam.smp`firstfit_carrier_pool %l5 = 0x0000000000000000

This is interesting.  Notice how the low 32-bits 004631a0 show up in
three variations:
1. 00000001004631a0 beam.smp`firstfit_carrier_pool (the address of the
firstfit_carrier_pool global variable)
2. ffffffff004631a0 (the above, but with the high 32 bits replaced
with all-bits-one)
3. ffffffff004631a1 (the above, but with a tag in the low bit)

> %g6 = 0x0000000000000000                 %l6 = 0x0000000000000000
> %g7 = 0xffffffff39401a40                 %l7 = 0x0000000000000000
> %o0 = 0x000000010051c500                 %i0 = 0x000000010051c500
> %o1 = 0xffffffff7a400000                 %i1 = 0xffffffff7a400000
> %o2 = 0x00000000000676c0                 %i2 = 0xffffffff7a441de8
> %o3 = 0xffffffff7a400018                 %i3 = 0xffffffff7c903818
> %o4 = 0x00000000000007b9                 %i4 = 0x0000000000000000
> %o5 = 0x000000010051c790                 %i5 = 0x0000000000000023
> %o6 = 0xffffffff7c902eb1                 %i6 = 0xffffffff7c902f61
> %o7 = 0x000000010006de3c abandon_carrier+0x118 %i7 = 0x000000010006e958 dealloc_block.part.17+0x1c0
>
>  %ccr = 0x44 xcc=nZvc icc=nZvc
>    %y = 0x0000000000000000
>   %pc = 0x000000010006db14 cpool_insert+0xd0
>  %npc = 0x000000010006db18 cpool_insert+0xd4
>   %sp = 0xffffffff7c902eb1
>   %fp = 0xffffffff7c902f61
>
>  %asi = 0x82
> %fprs = 0x00
>
> # dissassembly around pc
>
> cpool_insert+0xa8:              mov       %g1, %g2
> cpool_insert+0xac:              ldx       [%g5 + 0x10], %g1
> cpool_insert+0xb0:              membar    #LoadLoad|#LoadStore
> cpool_insert+0xb4:              ba,pt     %xcc, +0x1c   <cpool_insert+0xd0>
> cpool_insert+0xb8:              and       %g1, -0x4, %g4

> cpool_insert+0xbc:              membar    #LoadLoad|#LoadStore
> cpool_insert+0xc0:              and       %g2, 0x3, %g3
> cpool_insert+0xc4:              brz,pn    %g3, +0x1ec   <cpool_insert+0x2b0>
> cpool_insert+0xc8:              mov       %g2, %g1
> cpool_insert+0xcc:              and       %g1, -0x4, %g4
> cpool_insert+0xd0:              ld        [%g4 + 0x10], %g1

This is the faulting instruction. We're in the /* Find a predecessor
to be, and set mod marker on its next ptr */ loop.

> cpool_insert+0xd4:              ld        [%g4 + 0x14], %g2
> cpool_insert+0xd8:              sllx      %g1, 0x20, %g1
> cpool_insert+0xdc:              cmp       %g5, %g4
> cpool_insert+0xe0:              bne,pt    %xcc, -0x24   <cpool_insert+0xbc>
> cpool_insert+0xe4:              or        %g2, %g1, %g2

The above reads a 64-bit "->next" pointer by assembling two adjacent
32-bit fields.  Weird, but arithmetically Ok.

Two things strike me:
1. The compiler implements "atomic load of 64-bits" as "load 32 bits,
load another 32 bits, combine", which isn't correct in a multithreaded
program.  The error could be in the compiler, or in the source code.
2. In the register dump it was obvious that the high bits of an
address had been clobbered.

My suspicion is that either Sun's compiler is buggy, or Erlang is
selecting non thread-safe code in this case.

On SPARC64 Linux w/ GCC I get very different code that uses "ldx" for
those 64-bit loads, as expected.

/Mikael
Reply | Threaded
Open this post in threaded view
|

Re: erlang (rabbitmq) generating core on Solaris SPARC

Pooja Desai

Hi,

 

Thanks for response Mikael

As per your suggestion I am trying to write similar code to conclude if there is some issue with Solaris SPARC compiler.

 

But I have some doubts,

1.     If there is problem with compiler then we should be able to see this crash everywhere else also, any idea why its only reproduced here?

2.     As I understand your explanation it reads 64 bits by assembling two adjacent 32 bits fields. Will it really cause problem in multi-threaded program? Considering while context switching to another thread, OS will save current context of the thread (and hence registers) and will bring back when thread is active again.


 

Thanks & Regards,

Pooja 


On Mon, May 11, 2020 at 10:36 PM Mikael Pettersson <[hidden email]> wrote:
Hello Pooja,

On Mon, May 11, 2020 at 8:10 AM Pooja Desai <[hidden email]> wrote:
>
> Hi,
>
> Facing erlang core issue on solaris SPARC setup while running RabbitMQ

This looks like a 64-bit build, but the code doesn't look similar to
what I get with gcc-9.3, so I'm assuming you used Sun's compiler?


> (dbx) where
>
> =>[1] cpool_insert(0x1004efd40, 0xffffffff75600000, 0x61850, 0xffffffff75600018, 0x90f, 0x1004effd0), at 0x10006db14
>
>   [2] abandon_carrier(0x1004efd40, 0xffffffff75600000, 0xffffffff75645ec0, 0xffffffff77d03818, 0x0, 0x6), at 0x10006de3c
>
>   [3] 17(0x1004efd40, 0xcb3, 0x2, 0xffffffff75645e60, 0x0, 0x1004efd40), at 0x10006e958
>
>   [4] erts_alcu_check_delayed_dealloc(0x1004efd40, 0x1, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x8000000000000007), at 0x100075244
>
>   [5] erts_alloc_scheduler_handle_delayed_dealloc(0xffffffff3a82a620, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x100464, 0xffffffff3a82a5d0),
>
> at 0x1000622c0
>
>   [6] handle_aux_work(0xffffffff3a8204a0, 0x2, 0x1, 0x2, 0x100400, 0x4e5ce123), at 0x1002a6044
>
>   [7] erts_schedule(0xffffffff3a820380, 0x9, 0x9, 0xffffffff3a81fc80, 0x2, 0x2), at 0x1002a3040
>
>   [8] process_main(0x100469, 0xffffffff3a302240, 0xfa0, 0x802a, 0xffffffff38f00438, 0x3), at 0x1002901bc
>
>   [9] sched_thread_func(0xffffffff3a820380, 0x0, 0x0, 0xffffffff7a911240, 0x100000, 0x1), at 0x100038f08
>
>   [10] thr_wrapper(0xffffffff7fffc278, 0x0, 0x0, 0x100289d48, 0xffffffff3a820380, 0x100038da0), at 0x100289dc8
>
>
>
> This issue is extremely intermittent so I am not able to reproduce it with debug build. But on our test setup I have seen this core twice only for solaris Sparc server for other servers (RHEL, Suse linux, Solarisx86, Windows etc.) with similar test environment things are working fine.
>
> In two instances when I faced this issue we are restarting Rabbitmq server. i.e. stop RabbitMQ and epmd then run startup script for rabbitmq. This performs 2 operations,
>
> First ping rabbitmq using "rabbitmqctl ping" to confirm rabbitmq is not already running ( I guess in background this will also start epmd) and then start rabbitmq-server in detached mode.
>
> Core is generated while starting this demon.
>
>
> I checked code around abandon_carrier("https://github.com/erlang/otp/blame/master/erts/emulator/beam/erl_alloc_util.c") but nothing changed in that area recently. So I am really clueless situation.
>
> Please le me know if anyone faced similar issue in past or have any idea around this. Using OTP version 22.2 and RabbitMQ version 3.7.23.
>
> Let me know any further information is required, pasting full core dump information below:
>
> debugging core file of beam.smp (64-bit) from hostname01
> file: temp_dir/erlang/erts-10.6/bin/beam.smp
> initial argv:
> /temp_dir/erlang/erts-10.6/bin/beam.smp -- -root /temp_dir/
> threading model: native threads
> status: process terminated by SIGSEGV (Segmentation Fault), addr=
> ffffffff004631b0

Ok, this tells us the address was unmapped.  (It's not an alignment
fault, another common issue on SPARC.)


>
> C++ symbol demangling enabled
>
> # stack
>
> cpool_insert+0xd0(10051c500, ffffffff7a400000, ffffffff7a441de8, ffffffff7c903818, 0, 23)
> dealloc_block.part.17+0x1c0(10051c500, cb3, 2, ffffffff7a441d88, 0, 10051c500)
> erts_alcu_check_delayed_dealloc+0xe4(10051c500, 1, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 8000000000000007)
> erts_alloc_scheduler_handle_delayed_dealloc+0x34(ffffffff3b729c20, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 100464, ffffffff3b729bd0)
> handle_aux_work+0xa50(ffffffff3b71faa0, 402, 1, 402, 100400, 42da0c68)
> erts_schedule+0x192c(ffffffff3b71f980, 9, 9, ffffffff3b71f280, 402, 2)
> process_main+0xc4(100469, ffffffff3b202240, fa0, ffffffff3b71f980, 241, 100294204)
> sched_thread_func+0x168(ffffffff3b71f980, 0, 0, ffffffff39401a40, 100000, 1)
> thr_wrapper+0x80(ffffffff7fffb318, 0, 0, 100289d48, ffffffff3b71f980, 100038da0)
> libc.so.1`_lwp_start(0, 0, 0, 0, 0, 0)
>
> #############################################################################
>
> # registers
>
> %g0 = 0x0000000000000000                 %l0 = 0xffffffff7a4307a0
> %g1 = 0xffffffff004631a1                 %l1 = 0x0000000000000000
> %g2 = 0x0000000000000000                 %l2 = 0x0000000000000000
> %g3 = 0x000000010051c798                 %l3 = 0x0000000000000000
> %g4 = 0xffffffff004631a0                 %l4 = 0x0000000000000000
> %g5 = 0x00000001004631a0 beam.smp`firstfit_carrier_pool %l5 = 0x0000000000000000

This is interesting.  Notice how the low 32-bits 004631a0 show up in
three variations:
1. 00000001004631a0 beam.smp`firstfit_carrier_pool (the address of the
firstfit_carrier_pool global variable)
2. ffffffff004631a0 (the above, but with the high 32 bits replaced
with all-bits-one)
3. ffffffff004631a1 (the above, but with a tag in the low bit)

> %g6 = 0x0000000000000000                 %l6 = 0x0000000000000000
> %g7 = 0xffffffff39401a40                 %l7 = 0x0000000000000000
> %o0 = 0x000000010051c500                 %i0 = 0x000000010051c500
> %o1 = 0xffffffff7a400000                 %i1 = 0xffffffff7a400000
> %o2 = 0x00000000000676c0                 %i2 = 0xffffffff7a441de8
> %o3 = 0xffffffff7a400018                 %i3 = 0xffffffff7c903818
> %o4 = 0x00000000000007b9                 %i4 = 0x0000000000000000
> %o5 = 0x000000010051c790                 %i5 = 0x0000000000000023
> %o6 = 0xffffffff7c902eb1                 %i6 = 0xffffffff7c902f61
> %o7 = 0x000000010006de3c abandon_carrier+0x118 %i7 = 0x000000010006e958 dealloc_block.part.17+0x1c0
>
>  %ccr = 0x44 xcc=nZvc icc=nZvc
>    %y = 0x0000000000000000
>   %pc = 0x000000010006db14 cpool_insert+0xd0
>  %npc = 0x000000010006db18 cpool_insert+0xd4
>   %sp = 0xffffffff7c902eb1
>   %fp = 0xffffffff7c902f61
>
>  %asi = 0x82
> %fprs = 0x00
>
> # dissassembly around pc
>
> cpool_insert+0xa8:              mov       %g1, %g2
> cpool_insert+0xac:              ldx       [%g5 + 0x10], %g1
> cpool_insert+0xb0:              membar    #LoadLoad|#LoadStore
> cpool_insert+0xb4:              ba,pt     %xcc, +0x1c   <cpool_insert+0xd0>
> cpool_insert+0xb8:              and       %g1, -0x4, %g4

> cpool_insert+0xbc:              membar    #LoadLoad|#LoadStore
> cpool_insert+0xc0:              and       %g2, 0x3, %g3
> cpool_insert+0xc4:              brz,pn    %g3, +0x1ec   <cpool_insert+0x2b0>
> cpool_insert+0xc8:              mov       %g2, %g1
> cpool_insert+0xcc:              and       %g1, -0x4, %g4
> cpool_insert+0xd0:              ld        [%g4 + 0x10], %g1

This is the faulting instruction. We're in the /* Find a predecessor
to be, and set mod marker on its next ptr */ loop.

> cpool_insert+0xd4:              ld        [%g4 + 0x14], %g2
> cpool_insert+0xd8:              sllx      %g1, 0x20, %g1
> cpool_insert+0xdc:              cmp       %g5, %g4
> cpool_insert+0xe0:              bne,pt    %xcc, -0x24   <cpool_insert+0xbc>
> cpool_insert+0xe4:              or        %g2, %g1, %g2

The above reads a 64-bit "->next" pointer by assembling two adjacent
32-bit fields.  Weird, but arithmetically Ok.

Two things strike me:
1. The compiler implements "atomic load of 64-bits" as "load 32 bits,
load another 32 bits, combine", which isn't correct in a multithreaded
program.  The error could be in the compiler, or in the source code.
2. In the register dump it was obvious that the high bits of an
address had been clobbered.

My suspicion is that either Sun's compiler is buggy, or Erlang is
selecting non thread-safe code in this case.

On SPARC64 Linux w/ GCC I get very different code that uses "ldx" for
those 64-bit loads, as expected.

/Mikael
Reply | Threaded
Open this post in threaded view
|

Re: erlang (rabbitmq) generating core on Solaris SPARC

Mikael Pettersson-5
On Tue, May 12, 2020 at 4:18 PM Pooja Desai <[hidden email]> wrote:

>
> Hi,
>
>
>
> Thanks for response Mikael
>
> As per your suggestion I am trying to write similar code to conclude if there is some issue with Solaris SPARC compiler.
>
>
>
> But I have some doubts,
>
> 1.     If there is problem with compiler then we should be able to see this crash everywhere else also, any idea why its only reproduced here?
>
> 2.     As I understand your explanation it reads 64 bits by assembling two adjacent 32 bits fields. Will it really cause problem in multi-threaded program? Considering while context switching to another thread, OS will save current context of the thread (and hence registers) and will bring back when thread is active again.
>
>

Breaking up a 64-bit load into two 32-bit loads loses atomicity with
any concurrent store into that location, meaning the read may end up
observing a result composed of 32 bit from the old value and 32 bit
from the newly stored value, whereas the code expects to see either
the old or the new, but never this mixture.  This can happen also on a
single-threaded CPU with preemptive multitasking.

To move forward on the issue, I think you need to recreate the
pre-processed source for erl_alloc_util.c.  To do that:
1. Compile Erlang/OTP as usual, starting from a pristine source
directory (no left-overs from a previous build, best is to start fresh
somewhere), but pass "V=1" to make.  Save the output from "make" in a
file.
2. Note the step where it compiles erl_alloc_util.c.
3. Reexecute that step, but replace any "-c" with "-E" and "-o
erl_alloc_util.o" with "-o erl_alloc_util.i".
4. Please send this ".i" file, together with the exact build steps and
configuration options you used, and
"erts/sparc-sun-solaris11/config.h" (I'm guessing the file name here)
to me.

My theory is that Erlang/OTP selects the wrong low-level primitives
for this platform.


>
>
> Thanks & Regards,
>
> Pooja
>
>
> On Mon, May 11, 2020 at 10:36 PM Mikael Pettersson <[hidden email]> wrote:
>>
>> Hello Pooja,
>>
>> On Mon, May 11, 2020 at 8:10 AM Pooja Desai <[hidden email]> wrote:
>> >
>> > Hi,
>> >
>> > Facing erlang core issue on solaris SPARC setup while running RabbitMQ
>>
>> This looks like a 64-bit build, but the code doesn't look similar to
>> what I get with gcc-9.3, so I'm assuming you used Sun's compiler?
>>
>>
>> > (dbx) where
>> >
>> > =>[1] cpool_insert(0x1004efd40, 0xffffffff75600000, 0x61850, 0xffffffff75600018, 0x90f, 0x1004effd0), at 0x10006db14
>> >
>> >   [2] abandon_carrier(0x1004efd40, 0xffffffff75600000, 0xffffffff75645ec0, 0xffffffff77d03818, 0x0, 0x6), at 0x10006de3c
>> >
>> >   [3] 17(0x1004efd40, 0xcb3, 0x2, 0xffffffff75645e60, 0x0, 0x1004efd40), at 0x10006e958
>> >
>> >   [4] erts_alcu_check_delayed_dealloc(0x1004efd40, 0x1, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x8000000000000007), at 0x100075244
>> >
>> >   [5] erts_alloc_scheduler_handle_delayed_dealloc(0xffffffff3a82a620, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x100464, 0xffffffff3a82a5d0),
>> >
>> > at 0x1000622c0
>> >
>> >   [6] handle_aux_work(0xffffffff3a8204a0, 0x2, 0x1, 0x2, 0x100400, 0x4e5ce123), at 0x1002a6044
>> >
>> >   [7] erts_schedule(0xffffffff3a820380, 0x9, 0x9, 0xffffffff3a81fc80, 0x2, 0x2), at 0x1002a3040
>> >
>> >   [8] process_main(0x100469, 0xffffffff3a302240, 0xfa0, 0x802a, 0xffffffff38f00438, 0x3), at 0x1002901bc
>> >
>> >   [9] sched_thread_func(0xffffffff3a820380, 0x0, 0x0, 0xffffffff7a911240, 0x100000, 0x1), at 0x100038f08
>> >
>> >   [10] thr_wrapper(0xffffffff7fffc278, 0x0, 0x0, 0x100289d48, 0xffffffff3a820380, 0x100038da0), at 0x100289dc8
>> >
>> >
>> >
>> > This issue is extremely intermittent so I am not able to reproduce it with debug build. But on our test setup I have seen this core twice only for solaris Sparc server for other servers (RHEL, Suse linux, Solarisx86, Windows etc.) with similar test environment things are working fine.
>> >
>> > In two instances when I faced this issue we are restarting Rabbitmq server. i.e. stop RabbitMQ and epmd then run startup script for rabbitmq. This performs 2 operations,
>> >
>> > First ping rabbitmq using "rabbitmqctl ping" to confirm rabbitmq is not already running ( I guess in background this will also start epmd) and then start rabbitmq-server in detached mode.
>> >
>> > Core is generated while starting this demon.
>> >
>> >
>> > I checked code around abandon_carrier("https://github.com/erlang/otp/blame/master/erts/emulator/beam/erl_alloc_util.c") but nothing changed in that area recently. So I am really clueless situation.
>> >
>> > Please le me know if anyone faced similar issue in past or have any idea around this. Using OTP version 22.2 and RabbitMQ version 3.7.23.
>> >
>> > Let me know any further information is required, pasting full core dump information below:
>> >
>> > debugging core file of beam.smp (64-bit) from hostname01
>> > file: temp_dir/erlang/erts-10.6/bin/beam.smp
>> > initial argv:
>> > /temp_dir/erlang/erts-10.6/bin/beam.smp -- -root /temp_dir/
>> > threading model: native threads
>> > status: process terminated by SIGSEGV (Segmentation Fault), addr=
>> > ffffffff004631b0
>>
>> Ok, this tells us the address was unmapped.  (It's not an alignment
>> fault, another common issue on SPARC.)
>>
>>
>> >
>> > C++ symbol demangling enabled
>> >
>> > # stack
>> >
>> > cpool_insert+0xd0(10051c500, ffffffff7a400000, ffffffff7a441de8, ffffffff7c903818, 0, 23)
>> > dealloc_block.part.17+0x1c0(10051c500, cb3, 2, ffffffff7a441d88, 0, 10051c500)
>> > erts_alcu_check_delayed_dealloc+0xe4(10051c500, 1, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 8000000000000007)
>> > erts_alloc_scheduler_handle_delayed_dealloc+0x34(ffffffff3b729c20, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 100464, ffffffff3b729bd0)
>> > handle_aux_work+0xa50(ffffffff3b71faa0, 402, 1, 402, 100400, 42da0c68)
>> > erts_schedule+0x192c(ffffffff3b71f980, 9, 9, ffffffff3b71f280, 402, 2)
>> > process_main+0xc4(100469, ffffffff3b202240, fa0, ffffffff3b71f980, 241, 100294204)
>> > sched_thread_func+0x168(ffffffff3b71f980, 0, 0, ffffffff39401a40, 100000, 1)
>> > thr_wrapper+0x80(ffffffff7fffb318, 0, 0, 100289d48, ffffffff3b71f980, 100038da0)
>> > libc.so.1`_lwp_start(0, 0, 0, 0, 0, 0)
>> >
>> > #############################################################################
>> >
>> > # registers
>> >
>> > %g0 = 0x0000000000000000                 %l0 = 0xffffffff7a4307a0
>> > %g1 = 0xffffffff004631a1                 %l1 = 0x0000000000000000
>> > %g2 = 0x0000000000000000                 %l2 = 0x0000000000000000
>> > %g3 = 0x000000010051c798                 %l3 = 0x0000000000000000
>> > %g4 = 0xffffffff004631a0                 %l4 = 0x0000000000000000
>> > %g5 = 0x00000001004631a0 beam.smp`firstfit_carrier_pool %l5 = 0x0000000000000000
>>
>> This is interesting.  Notice how the low 32-bits 004631a0 show up in
>> three variations:
>> 1. 00000001004631a0 beam.smp`firstfit_carrier_pool (the address of the
>> firstfit_carrier_pool global variable)
>> 2. ffffffff004631a0 (the above, but with the high 32 bits replaced
>> with all-bits-one)
>> 3. ffffffff004631a1 (the above, but with a tag in the low bit)
>>
>> > %g6 = 0x0000000000000000                 %l6 = 0x0000000000000000
>> > %g7 = 0xffffffff39401a40                 %l7 = 0x0000000000000000
>> > %o0 = 0x000000010051c500                 %i0 = 0x000000010051c500
>> > %o1 = 0xffffffff7a400000                 %i1 = 0xffffffff7a400000
>> > %o2 = 0x00000000000676c0                 %i2 = 0xffffffff7a441de8
>> > %o3 = 0xffffffff7a400018                 %i3 = 0xffffffff7c903818
>> > %o4 = 0x00000000000007b9                 %i4 = 0x0000000000000000
>> > %o5 = 0x000000010051c790                 %i5 = 0x0000000000000023
>> > %o6 = 0xffffffff7c902eb1                 %i6 = 0xffffffff7c902f61
>> > %o7 = 0x000000010006de3c abandon_carrier+0x118 %i7 = 0x000000010006e958 dealloc_block.part.17+0x1c0
>> >
>> >  %ccr = 0x44 xcc=nZvc icc=nZvc
>> >    %y = 0x0000000000000000
>> >   %pc = 0x000000010006db14 cpool_insert+0xd0
>> >  %npc = 0x000000010006db18 cpool_insert+0xd4
>> >   %sp = 0xffffffff7c902eb1
>> >   %fp = 0xffffffff7c902f61
>> >
>> >  %asi = 0x82
>> > %fprs = 0x00
>> >
>> > # dissassembly around pc
>> >
>> > cpool_insert+0xa8:              mov       %g1, %g2
>> > cpool_insert+0xac:              ldx       [%g5 + 0x10], %g1
>> > cpool_insert+0xb0:              membar    #LoadLoad|#LoadStore
>> > cpool_insert+0xb4:              ba,pt     %xcc, +0x1c   <cpool_insert+0xd0>
>> > cpool_insert+0xb8:              and       %g1, -0x4, %g4
>>
>> > cpool_insert+0xbc:              membar    #LoadLoad|#LoadStore
>> > cpool_insert+0xc0:              and       %g2, 0x3, %g3
>> > cpool_insert+0xc4:              brz,pn    %g3, +0x1ec   <cpool_insert+0x2b0>
>> > cpool_insert+0xc8:              mov       %g2, %g1
>> > cpool_insert+0xcc:              and       %g1, -0x4, %g4
>> > cpool_insert+0xd0:              ld        [%g4 + 0x10], %g1
>>
>> This is the faulting instruction. We're in the /* Find a predecessor
>> to be, and set mod marker on its next ptr */ loop.
>>
>> > cpool_insert+0xd4:              ld        [%g4 + 0x14], %g2
>> > cpool_insert+0xd8:              sllx      %g1, 0x20, %g1
>> > cpool_insert+0xdc:              cmp       %g5, %g4
>> > cpool_insert+0xe0:              bne,pt    %xcc, -0x24   <cpool_insert+0xbc>
>> > cpool_insert+0xe4:              or        %g2, %g1, %g2
>>
>> The above reads a 64-bit "->next" pointer by assembling two adjacent
>> 32-bit fields.  Weird, but arithmetically Ok.
>>
>> Two things strike me:
>> 1. The compiler implements "atomic load of 64-bits" as "load 32 bits,
>> load another 32 bits, combine", which isn't correct in a multithreaded
>> program.  The error could be in the compiler, or in the source code.
>> 2. In the register dump it was obvious that the high bits of an
>> address had been clobbered.
>>
>> My suspicion is that either Sun's compiler is buggy, or Erlang is
>> selecting non thread-safe code in this case.
>>
>> On SPARC64 Linux w/ GCC I get very different code that uses "ldx" for
>> those 64-bit loads, as expected.
>>
>> /Mikael
Reply | Threaded
Open this post in threaded view
|

Re: erlang (rabbitmq) generating core on Solaris SPARC

Pooja Desai

Hi Mikael,


Please find flies you requested in attachment as erl_files.tar.gz (compressed as facing issue with mail size)

Normal build option is:

# gcc  -Werror=undef -Werror=implicit -Werror=return-type  -m64 -g  -O3 -fomit-frame-pointer -Ierlang/src/solaris/otp/erts/sparc-sun-solaris2.10  -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -fno-tree-copyrename   -DHAVE_CONFIG_H -Wall -Wstrict-prototypes -Wmissing-prototypes -Wdeclaration-after-statement -DUSE_THREADS -D_THREAD_SAFE -D_REENTRANT -DPOSIX_THREADS -D_POSIX_PTHREAD_SEMANTICS   -Isparc-sun-solaris2.10/opt/smp -Ibeam -Isys/unix -Isys/common -Isparc-sun-solaris2.10 -Izlib  -Ipcre -Ihipe -I../include -I../include/sparc-sun-solaris2.10 -I../include/internal -I../include/internal/sparc-sun-solaris2.10 -c beam/erl_alloc_util.c -o obj/sparc-sun-solaris2.10/opt/smp/erl_alloc_util.o

after your suggestion I updated it as below to generate erl_alloc_util file:

# gcc  -Werror=undef -Werror=implicit -Werror=return-type  -m64 -g  -O3 -fomit-frame-pointer -Ierlang/src/solaris/otp/erts/sparc-sun-solaris2.10  -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -fno-tree-copyrename   -DHAVE_CONFIG_H -Wall -Wstrict-prototypes -Wmissing-prototypes -Wdeclaration-after-statement -DUSE_THREADS -D_THREAD_SAFE -D_REENTRANT -DPOSIX_THREADS -D_POSIX_PTHREAD_SEMANTICS   -Isparc-sun-solaris2.10/opt/smp -Ibeam -Isys/unix -Isys/common -Isparc-sun-solaris2.10 -Izlib  -Ipcre -Ihipe -I../include -I../include/sparc-sun-solaris2.10 -I../include/internal -I../include/internal/sparc-sun-solaris2.10 -E beam/erl_alloc_util.c -o obj/sparc-sun-solaris2.10/opt/smp/erl_alloc_util.i

Also one thing I missed to mention, we are using gcc version 4.9.2 (GCC) for building on solaris SPARC as erlang doesn't support Sun's native compiler.

Thanks & Regards,
Pooja

On Tue, May 12, 2020 at 10:44 PM Mikael Pettersson <[hidden email]> wrote:
On Tue, May 12, 2020 at 4:18 PM Pooja Desai <[hidden email]> wrote:
>
> Hi,
>
>
>
> Thanks for response Mikael
>
> As per your suggestion I am trying to write similar code to conclude if there is some issue with Solaris SPARC compiler.
>
>
>
> But I have some doubts,
>
> 1.     If there is problem with compiler then we should be able to see this crash everywhere else also, any idea why its only reproduced here?
>
> 2.     As I understand your explanation it reads 64 bits by assembling two adjacent 32 bits fields. Will it really cause problem in multi-threaded program? Considering while context switching to another thread, OS will save current context of the thread (and hence registers) and will bring back when thread is active again.
>
>

Breaking up a 64-bit load into two 32-bit loads loses atomicity with
any concurrent store into that location, meaning the read may end up
observing a result composed of 32 bit from the old value and 32 bit
from the newly stored value, whereas the code expects to see either
the old or the new, but never this mixture.  This can happen also on a
single-threaded CPU with preemptive multitasking.

To move forward on the issue, I think you need to recreate the
pre-processed source for erl_alloc_util.c.  To do that:
1. Compile Erlang/OTP as usual, starting from a pristine source
directory (no left-overs from a previous build, best is to start fresh
somewhere), but pass "V=1" to make.  Save the output from "make" in a
file.
2. Note the step where it compiles erl_alloc_util.c.
3. Reexecute that step, but replace any "-c" with "-E" and "-o
erl_alloc_util.o" with "-o erl_alloc_util.i".
4. Please send this ".i" file, together with the exact build steps and
configuration options you used, and
"erts/sparc-sun-solaris11/config.h" (I'm guessing the file name here)
to me.

My theory is that Erlang/OTP selects the wrong low-level primitives
for this platform.


>
>
> Thanks & Regards,
>
> Pooja
>
>
> On Mon, May 11, 2020 at 10:36 PM Mikael Pettersson <[hidden email]> wrote:
>>
>> Hello Pooja,
>>
>> On Mon, May 11, 2020 at 8:10 AM Pooja Desai <[hidden email]> wrote:
>> >
>> > Hi,
>> >
>> > Facing erlang core issue on solaris SPARC setup while running RabbitMQ
>>
>> This looks like a 64-bit build, but the code doesn't look similar to
>> what I get with gcc-9.3, so I'm assuming you used Sun's compiler?
>>
>>
>> > (dbx) where
>> >
>> > =>[1] cpool_insert(0x1004efd40, 0xffffffff75600000, 0x61850, 0xffffffff75600018, 0x90f, 0x1004effd0), at 0x10006db14
>> >
>> >   [2] abandon_carrier(0x1004efd40, 0xffffffff75600000, 0xffffffff75645ec0, 0xffffffff77d03818, 0x0, 0x6), at 0x10006de3c
>> >
>> >   [3] 17(0x1004efd40, 0xcb3, 0x2, 0xffffffff75645e60, 0x0, 0x1004efd40), at 0x10006e958
>> >
>> >   [4] erts_alcu_check_delayed_dealloc(0x1004efd40, 0x1, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x8000000000000007), at 0x100075244
>> >
>> >   [5] erts_alloc_scheduler_handle_delayed_dealloc(0xffffffff3a82a620, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x100464, 0xffffffff3a82a5d0),
>> >
>> > at 0x1000622c0
>> >
>> >   [6] handle_aux_work(0xffffffff3a8204a0, 0x2, 0x1, 0x2, 0x100400, 0x4e5ce123), at 0x1002a6044
>> >
>> >   [7] erts_schedule(0xffffffff3a820380, 0x9, 0x9, 0xffffffff3a81fc80, 0x2, 0x2), at 0x1002a3040
>> >
>> >   [8] process_main(0x100469, 0xffffffff3a302240, 0xfa0, 0x802a, 0xffffffff38f00438, 0x3), at 0x1002901bc
>> >
>> >   [9] sched_thread_func(0xffffffff3a820380, 0x0, 0x0, 0xffffffff7a911240, 0x100000, 0x1), at 0x100038f08
>> >
>> >   [10] thr_wrapper(0xffffffff7fffc278, 0x0, 0x0, 0x100289d48, 0xffffffff3a820380, 0x100038da0), at 0x100289dc8
>> >
>> >
>> >
>> > This issue is extremely intermittent so I am not able to reproduce it with debug build. But on our test setup I have seen this core twice only for solaris Sparc server for other servers (RHEL, Suse linux, Solarisx86, Windows etc.) with similar test environment things are working fine.
>> >
>> > In two instances when I faced this issue we are restarting Rabbitmq server. i.e. stop RabbitMQ and epmd then run startup script for rabbitmq. This performs 2 operations,
>> >
>> > First ping rabbitmq using "rabbitmqctl ping" to confirm rabbitmq is not already running ( I guess in background this will also start epmd) and then start rabbitmq-server in detached mode.
>> >
>> > Core is generated while starting this demon.
>> >
>> >
>> > I checked code around abandon_carrier("https://github.com/erlang/otp/blame/master/erts/emulator/beam/erl_alloc_util.c") but nothing changed in that area recently. So I am really clueless situation.
>> >
>> > Please le me know if anyone faced similar issue in past or have any idea around this. Using OTP version 22.2 and RabbitMQ version 3.7.23.
>> >
>> > Let me know any further information is required, pasting full core dump information below:
>> >
>> > debugging core file of beam.smp (64-bit) from hostname01
>> > file: temp_dir/erlang/erts-10.6/bin/beam.smp
>> > initial argv:
>> > /temp_dir/erlang/erts-10.6/bin/beam.smp -- -root /temp_dir/
>> > threading model: native threads
>> > status: process terminated by SIGSEGV (Segmentation Fault), addr=
>> > ffffffff004631b0
>>
>> Ok, this tells us the address was unmapped.  (It's not an alignment
>> fault, another common issue on SPARC.)
>>
>>
>> >
>> > C++ symbol demangling enabled
>> >
>> > # stack
>> >
>> > cpool_insert+0xd0(10051c500, ffffffff7a400000, ffffffff7a441de8, ffffffff7c903818, 0, 23)
>> > dealloc_block.part.17+0x1c0(10051c500, cb3, 2, ffffffff7a441d88, 0, 10051c500)
>> > erts_alcu_check_delayed_dealloc+0xe4(10051c500, 1, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 8000000000000007)
>> > erts_alloc_scheduler_handle_delayed_dealloc+0x34(ffffffff3b729c20, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 100464, ffffffff3b729bd0)
>> > handle_aux_work+0xa50(ffffffff3b71faa0, 402, 1, 402, 100400, 42da0c68)
>> > erts_schedule+0x192c(ffffffff3b71f980, 9, 9, ffffffff3b71f280, 402, 2)
>> > process_main+0xc4(100469, ffffffff3b202240, fa0, ffffffff3b71f980, 241, 100294204)
>> > sched_thread_func+0x168(ffffffff3b71f980, 0, 0, ffffffff39401a40, 100000, 1)
>> > thr_wrapper+0x80(ffffffff7fffb318, 0, 0, 100289d48, ffffffff3b71f980, 100038da0)
>> > libc.so.1`_lwp_start(0, 0, 0, 0, 0, 0)
>> >
>> > #############################################################################
>> >
>> > # registers
>> >
>> > %g0 = 0x0000000000000000                 %l0 = 0xffffffff7a4307a0
>> > %g1 = 0xffffffff004631a1                 %l1 = 0x0000000000000000
>> > %g2 = 0x0000000000000000                 %l2 = 0x0000000000000000
>> > %g3 = 0x000000010051c798                 %l3 = 0x0000000000000000
>> > %g4 = 0xffffffff004631a0                 %l4 = 0x0000000000000000
>> > %g5 = 0x00000001004631a0 beam.smp`firstfit_carrier_pool %l5 = 0x0000000000000000
>>
>> This is interesting.  Notice how the low 32-bits 004631a0 show up in
>> three variations:
>> 1. 00000001004631a0 beam.smp`firstfit_carrier_pool (the address of the
>> firstfit_carrier_pool global variable)
>> 2. ffffffff004631a0 (the above, but with the high 32 bits replaced
>> with all-bits-one)
>> 3. ffffffff004631a1 (the above, but with a tag in the low bit)
>>
>> > %g6 = 0x0000000000000000                 %l6 = 0x0000000000000000
>> > %g7 = 0xffffffff39401a40                 %l7 = 0x0000000000000000
>> > %o0 = 0x000000010051c500                 %i0 = 0x000000010051c500
>> > %o1 = 0xffffffff7a400000                 %i1 = 0xffffffff7a400000
>> > %o2 = 0x00000000000676c0                 %i2 = 0xffffffff7a441de8
>> > %o3 = 0xffffffff7a400018                 %i3 = 0xffffffff7c903818
>> > %o4 = 0x00000000000007b9                 %i4 = 0x0000000000000000
>> > %o5 = 0x000000010051c790                 %i5 = 0x0000000000000023
>> > %o6 = 0xffffffff7c902eb1                 %i6 = 0xffffffff7c902f61
>> > %o7 = 0x000000010006de3c abandon_carrier+0x118 %i7 = 0x000000010006e958 dealloc_block.part.17+0x1c0
>> >
>> >  %ccr = 0x44 xcc=nZvc icc=nZvc
>> >    %y = 0x0000000000000000
>> >   %pc = 0x000000010006db14 cpool_insert+0xd0
>> >  %npc = 0x000000010006db18 cpool_insert+0xd4
>> >   %sp = 0xffffffff7c902eb1
>> >   %fp = 0xffffffff7c902f61
>> >
>> >  %asi = 0x82
>> > %fprs = 0x00
>> >
>> > # dissassembly around pc
>> >
>> > cpool_insert+0xa8:              mov       %g1, %g2
>> > cpool_insert+0xac:              ldx       [%g5 + 0x10], %g1
>> > cpool_insert+0xb0:              membar    #LoadLoad|#LoadStore
>> > cpool_insert+0xb4:              ba,pt     %xcc, +0x1c   <cpool_insert+0xd0>
>> > cpool_insert+0xb8:              and       %g1, -0x4, %g4
>>
>> > cpool_insert+0xbc:              membar    #LoadLoad|#LoadStore
>> > cpool_insert+0xc0:              and       %g2, 0x3, %g3
>> > cpool_insert+0xc4:              brz,pn    %g3, +0x1ec   <cpool_insert+0x2b0>
>> > cpool_insert+0xc8:              mov       %g2, %g1
>> > cpool_insert+0xcc:              and       %g1, -0x4, %g4
>> > cpool_insert+0xd0:              ld        [%g4 + 0x10], %g1
>>
>> This is the faulting instruction. We're in the /* Find a predecessor
>> to be, and set mod marker on its next ptr */ loop.
>>
>> > cpool_insert+0xd4:              ld        [%g4 + 0x14], %g2
>> > cpool_insert+0xd8:              sllx      %g1, 0x20, %g1
>> > cpool_insert+0xdc:              cmp       %g5, %g4
>> > cpool_insert+0xe0:              bne,pt    %xcc, -0x24   <cpool_insert+0xbc>
>> > cpool_insert+0xe4:              or        %g2, %g1, %g2
>>
>> The above reads a 64-bit "->next" pointer by assembling two adjacent
>> 32-bit fields.  Weird, but arithmetically Ok.
>>
>> Two things strike me:
>> 1. The compiler implements "atomic load of 64-bits" as "load 32 bits,
>> load another 32 bits, combine", which isn't correct in a multithreaded
>> program.  The error could be in the compiler, or in the source code.
>> 2. In the register dump it was obvious that the high bits of an
>> address had been clobbered.
>>
>> My suspicion is that either Sun's compiler is buggy, or Erlang is
>> selecting non thread-safe code in this case.
>>
>> On SPARC64 Linux w/ GCC I get very different code that uses "ldx" for
>> those 64-bit loads, as expected.
>>
>> /Mikael

erl_files.tar.gz (179K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: erlang (rabbitmq) generating core on Solaris SPARC

Mikael Pettersson-5
On Thu, May 14, 2020 at 9:32 AM Pooja Desai <[hidden email]> wrote:

>
> Hi Mikael,
>
>
> Please find flies you requested in attachment as erl_files.tar.gz (compressed as facing issue with mail size)
>
> Normal build option is:
>
> # gcc  -Werror=undef -Werror=implicit -Werror=return-type  -m64 -g  -O3 -fomit-frame-pointer -Ierlang/src/solaris/otp/erts/sparc-sun-solaris2.10  -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -fno-tree-copyrename   -DHAVE_CONFIG_H -Wall -Wstrict-prototypes -Wmissing-prototypes -Wdeclaration-after-statement -DUSE_THREADS -D_THREAD_SAFE -D_REENTRANT -DPOSIX_THREADS -D_POSIX_PTHREAD_SEMANTICS   -Isparc-sun-solaris2.10/opt/smp -Ibeam -Isys/unix -Isys/common -Isparc-sun-solaris2.10 -Izlib  -Ipcre -Ihipe -I../include -I../include/sparc-sun-solaris2.10 -I../include/internal -I../include/internal/sparc-sun-solaris2.10 -c beam/erl_alloc_util.c -o obj/sparc-sun-solaris2.10/opt/smp/erl_alloc_util.o
>
> after your suggestion I updated it as below to generate erl_alloc_util file:
>
> # gcc  -Werror=undef -Werror=implicit -Werror=return-type  -m64 -g  -O3 -fomit-frame-pointer -Ierlang/src/solaris/otp/erts/sparc-sun-solaris2.10  -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -fno-tree-copyrename   -DHAVE_CONFIG_H -Wall -Wstrict-prototypes -Wmissing-prototypes -Wdeclaration-after-statement -DUSE_THREADS -D_THREAD_SAFE -D_REENTRANT -DPOSIX_THREADS -D_POSIX_PTHREAD_SEMANTICS   -Isparc-sun-solaris2.10/opt/smp -Ibeam -Isys/unix -Isys/common -Isparc-sun-solaris2.10 -Izlib  -Ipcre -Ihipe -I../include -I../include/sparc-sun-solaris2.10 -I../include/internal -I../include/internal/sparc-sun-solaris2.10 -E beam/erl_alloc_util.c -o obj/sparc-sun-solaris2.10/opt/smp/erl_alloc_util.i
>
> Also one thing I missed to mention, we are using gcc version 4.9.2 (GCC) for building on solaris SPARC as erlang doesn't support Sun's native compiler.

I've been able to reproduce the non-atomic code for those 64-bit loads
in cpool_insert() using gcc-4.9 cross compilers to sparc64-linux, but
gcc-5.5/6.5/7.5/8.4/9.3 all emit correct code as far as I can tell.

So the solution is to upgrade your gcc (I suggest 9.3.0) and rebuild
your Erlang/OTP VM with that.

/Mikael

>
> Thanks & Regards,
> Pooja
>
> On Tue, May 12, 2020 at 10:44 PM Mikael Pettersson <[hidden email]> wrote:
>>
>> On Tue, May 12, 2020 at 4:18 PM Pooja Desai <[hidden email]> wrote:
>> >
>> > Hi,
>> >
>> >
>> >
>> > Thanks for response Mikael
>> >
>> > As per your suggestion I am trying to write similar code to conclude if there is some issue with Solaris SPARC compiler.
>> >
>> >
>> >
>> > But I have some doubts,
>> >
>> > 1.     If there is problem with compiler then we should be able to see this crash everywhere else also, any idea why its only reproduced here?
>> >
>> > 2.     As I understand your explanation it reads 64 bits by assembling two adjacent 32 bits fields. Will it really cause problem in multi-threaded program? Considering while context switching to another thread, OS will save current context of the thread (and hence registers) and will bring back when thread is active again.
>> >
>> >
>>
>> Breaking up a 64-bit load into two 32-bit loads loses atomicity with
>> any concurrent store into that location, meaning the read may end up
>> observing a result composed of 32 bit from the old value and 32 bit
>> from the newly stored value, whereas the code expects to see either
>> the old or the new, but never this mixture.  This can happen also on a
>> single-threaded CPU with preemptive multitasking.
>>
>> To move forward on the issue, I think you need to recreate the
>> pre-processed source for erl_alloc_util.c.  To do that:
>> 1. Compile Erlang/OTP as usual, starting from a pristine source
>> directory (no left-overs from a previous build, best is to start fresh
>> somewhere), but pass "V=1" to make.  Save the output from "make" in a
>> file.
>> 2. Note the step where it compiles erl_alloc_util.c.
>> 3. Reexecute that step, but replace any "-c" with "-E" and "-o
>> erl_alloc_util.o" with "-o erl_alloc_util.i".
>> 4. Please send this ".i" file, together with the exact build steps and
>> configuration options you used, and
>> "erts/sparc-sun-solaris11/config.h" (I'm guessing the file name here)
>> to me.
>>
>> My theory is that Erlang/OTP selects the wrong low-level primitives
>> for this platform.
>>
>>
>> >
>> >
>> > Thanks & Regards,
>> >
>> > Pooja
>> >
>> >
>> > On Mon, May 11, 2020 at 10:36 PM Mikael Pettersson <[hidden email]> wrote:
>> >>
>> >> Hello Pooja,
>> >>
>> >> On Mon, May 11, 2020 at 8:10 AM Pooja Desai <[hidden email]> wrote:
>> >> >
>> >> > Hi,
>> >> >
>> >> > Facing erlang core issue on solaris SPARC setup while running RabbitMQ
>> >>
>> >> This looks like a 64-bit build, but the code doesn't look similar to
>> >> what I get with gcc-9.3, so I'm assuming you used Sun's compiler?
>> >>
>> >>
>> >> > (dbx) where
>> >> >
>> >> > =>[1] cpool_insert(0x1004efd40, 0xffffffff75600000, 0x61850, 0xffffffff75600018, 0x90f, 0x1004effd0), at 0x10006db14
>> >> >
>> >> >   [2] abandon_carrier(0x1004efd40, 0xffffffff75600000, 0xffffffff75645ec0, 0xffffffff77d03818, 0x0, 0x6), at 0x10006de3c
>> >> >
>> >> >   [3] 17(0x1004efd40, 0xcb3, 0x2, 0xffffffff75645e60, 0x0, 0x1004efd40), at 0x10006e958
>> >> >
>> >> >   [4] erts_alcu_check_delayed_dealloc(0x1004efd40, 0x1, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x8000000000000007), at 0x100075244
>> >> >
>> >> >   [5] erts_alloc_scheduler_handle_delayed_dealloc(0xffffffff3a82a620, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x100464, 0xffffffff3a82a5d0),
>> >> >
>> >> > at 0x1000622c0
>> >> >
>> >> >   [6] handle_aux_work(0xffffffff3a8204a0, 0x2, 0x1, 0x2, 0x100400, 0x4e5ce123), at 0x1002a6044
>> >> >
>> >> >   [7] erts_schedule(0xffffffff3a820380, 0x9, 0x9, 0xffffffff3a81fc80, 0x2, 0x2), at 0x1002a3040
>> >> >
>> >> >   [8] process_main(0x100469, 0xffffffff3a302240, 0xfa0, 0x802a, 0xffffffff38f00438, 0x3), at 0x1002901bc
>> >> >
>> >> >   [9] sched_thread_func(0xffffffff3a820380, 0x0, 0x0, 0xffffffff7a911240, 0x100000, 0x1), at 0x100038f08
>> >> >
>> >> >   [10] thr_wrapper(0xffffffff7fffc278, 0x0, 0x0, 0x100289d48, 0xffffffff3a820380, 0x100038da0), at 0x100289dc8
>> >> >
>> >> >
>> >> >
>> >> > This issue is extremely intermittent so I am not able to reproduce it with debug build. But on our test setup I have seen this core twice only for solaris Sparc server for other servers (RHEL, Suse linux, Solarisx86, Windows etc.) with similar test environment things are working fine.
>> >> >
>> >> > In two instances when I faced this issue we are restarting Rabbitmq server. i.e. stop RabbitMQ and epmd then run startup script for rabbitmq. This performs 2 operations,
>> >> >
>> >> > First ping rabbitmq using "rabbitmqctl ping" to confirm rabbitmq is not already running ( I guess in background this will also start epmd) and then start rabbitmq-server in detached mode.
>> >> >
>> >> > Core is generated while starting this demon.
>> >> >
>> >> >
>> >> > I checked code around abandon_carrier("https://github.com/erlang/otp/blame/master/erts/emulator/beam/erl_alloc_util.c") but nothing changed in that area recently. So I am really clueless situation.
>> >> >
>> >> > Please le me know if anyone faced similar issue in past or have any idea around this. Using OTP version 22.2 and RabbitMQ version 3.7.23.
>> >> >
>> >> > Let me know any further information is required, pasting full core dump information below:
>> >> >
>> >> > debugging core file of beam.smp (64-bit) from hostname01
>> >> > file: temp_dir/erlang/erts-10.6/bin/beam.smp
>> >> > initial argv:
>> >> > /temp_dir/erlang/erts-10.6/bin/beam.smp -- -root /temp_dir/
>> >> > threading model: native threads
>> >> > status: process terminated by SIGSEGV (Segmentation Fault), addr=
>> >> > ffffffff004631b0
>> >>
>> >> Ok, this tells us the address was unmapped.  (It's not an alignment
>> >> fault, another common issue on SPARC.)
>> >>
>> >>
>> >> >
>> >> > C++ symbol demangling enabled
>> >> >
>> >> > # stack
>> >> >
>> >> > cpool_insert+0xd0(10051c500, ffffffff7a400000, ffffffff7a441de8, ffffffff7c903818, 0, 23)
>> >> > dealloc_block.part.17+0x1c0(10051c500, cb3, 2, ffffffff7a441d88, 0, 10051c500)
>> >> > erts_alcu_check_delayed_dealloc+0xe4(10051c500, 1, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 8000000000000007)
>> >> > erts_alloc_scheduler_handle_delayed_dealloc+0x34(ffffffff3b729c20, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 100464, ffffffff3b729bd0)
>> >> > handle_aux_work+0xa50(ffffffff3b71faa0, 402, 1, 402, 100400, 42da0c68)
>> >> > erts_schedule+0x192c(ffffffff3b71f980, 9, 9, ffffffff3b71f280, 402, 2)
>> >> > process_main+0xc4(100469, ffffffff3b202240, fa0, ffffffff3b71f980, 241, 100294204)
>> >> > sched_thread_func+0x168(ffffffff3b71f980, 0, 0, ffffffff39401a40, 100000, 1)
>> >> > thr_wrapper+0x80(ffffffff7fffb318, 0, 0, 100289d48, ffffffff3b71f980, 100038da0)
>> >> > libc.so.1`_lwp_start(0, 0, 0, 0, 0, 0)
>> >> >
>> >> > #############################################################################
>> >> >
>> >> > # registers
>> >> >
>> >> > %g0 = 0x0000000000000000                 %l0 = 0xffffffff7a4307a0
>> >> > %g1 = 0xffffffff004631a1                 %l1 = 0x0000000000000000
>> >> > %g2 = 0x0000000000000000                 %l2 = 0x0000000000000000
>> >> > %g3 = 0x000000010051c798                 %l3 = 0x0000000000000000
>> >> > %g4 = 0xffffffff004631a0                 %l4 = 0x0000000000000000
>> >> > %g5 = 0x00000001004631a0 beam.smp`firstfit_carrier_pool %l5 = 0x0000000000000000
>> >>
>> >> This is interesting.  Notice how the low 32-bits 004631a0 show up in
>> >> three variations:
>> >> 1. 00000001004631a0 beam.smp`firstfit_carrier_pool (the address of the
>> >> firstfit_carrier_pool global variable)
>> >> 2. ffffffff004631a0 (the above, but with the high 32 bits replaced
>> >> with all-bits-one)
>> >> 3. ffffffff004631a1 (the above, but with a tag in the low bit)
>> >>
>> >> > %g6 = 0x0000000000000000                 %l6 = 0x0000000000000000
>> >> > %g7 = 0xffffffff39401a40                 %l7 = 0x0000000000000000
>> >> > %o0 = 0x000000010051c500                 %i0 = 0x000000010051c500
>> >> > %o1 = 0xffffffff7a400000                 %i1 = 0xffffffff7a400000
>> >> > %o2 = 0x00000000000676c0                 %i2 = 0xffffffff7a441de8
>> >> > %o3 = 0xffffffff7a400018                 %i3 = 0xffffffff7c903818
>> >> > %o4 = 0x00000000000007b9                 %i4 = 0x0000000000000000
>> >> > %o5 = 0x000000010051c790                 %i5 = 0x0000000000000023
>> >> > %o6 = 0xffffffff7c902eb1                 %i6 = 0xffffffff7c902f61
>> >> > %o7 = 0x000000010006de3c abandon_carrier+0x118 %i7 = 0x000000010006e958 dealloc_block.part.17+0x1c0
>> >> >
>> >> >  %ccr = 0x44 xcc=nZvc icc=nZvc
>> >> >    %y = 0x0000000000000000
>> >> >   %pc = 0x000000010006db14 cpool_insert+0xd0
>> >> >  %npc = 0x000000010006db18 cpool_insert+0xd4
>> >> >   %sp = 0xffffffff7c902eb1
>> >> >   %fp = 0xffffffff7c902f61
>> >> >
>> >> >  %asi = 0x82
>> >> > %fprs = 0x00
>> >> >
>> >> > # dissassembly around pc
>> >> >
>> >> > cpool_insert+0xa8:              mov       %g1, %g2
>> >> > cpool_insert+0xac:              ldx       [%g5 + 0x10], %g1
>> >> > cpool_insert+0xb0:              membar    #LoadLoad|#LoadStore
>> >> > cpool_insert+0xb4:              ba,pt     %xcc, +0x1c   <cpool_insert+0xd0>
>> >> > cpool_insert+0xb8:              and       %g1, -0x4, %g4
>> >>
>> >> > cpool_insert+0xbc:              membar    #LoadLoad|#LoadStore
>> >> > cpool_insert+0xc0:              and       %g2, 0x3, %g3
>> >> > cpool_insert+0xc4:              brz,pn    %g3, +0x1ec   <cpool_insert+0x2b0>
>> >> > cpool_insert+0xc8:              mov       %g2, %g1
>> >> > cpool_insert+0xcc:              and       %g1, -0x4, %g4
>> >> > cpool_insert+0xd0:              ld        [%g4 + 0x10], %g1
>> >>
>> >> This is the faulting instruction. We're in the /* Find a predecessor
>> >> to be, and set mod marker on its next ptr */ loop.
>> >>
>> >> > cpool_insert+0xd4:              ld        [%g4 + 0x14], %g2
>> >> > cpool_insert+0xd8:              sllx      %g1, 0x20, %g1
>> >> > cpool_insert+0xdc:              cmp       %g5, %g4
>> >> > cpool_insert+0xe0:              bne,pt    %xcc, -0x24   <cpool_insert+0xbc>
>> >> > cpool_insert+0xe4:              or        %g2, %g1, %g2
>> >>
>> >> The above reads a 64-bit "->next" pointer by assembling two adjacent
>> >> 32-bit fields.  Weird, but arithmetically Ok.
>> >>
>> >> Two things strike me:
>> >> 1. The compiler implements "atomic load of 64-bits" as "load 32 bits,
>> >> load another 32 bits, combine", which isn't correct in a multithreaded
>> >> program.  The error could be in the compiler, or in the source code.
>> >> 2. In the register dump it was obvious that the high bits of an
>> >> address had been clobbered.
>> >>
>> >> My suspicion is that either Sun's compiler is buggy, or Erlang is
>> >> selecting non thread-safe code in this case.
>> >>
>> >> On SPARC64 Linux w/ GCC I get very different code that uses "ldx" for
>> >> those 64-bit loads, as expected.
>> >>
>> >> /Mikael
Reply | Threaded
Open this post in threaded view
|

Re: erlang (rabbitmq) generating core on Solaris SPARC

Richard O'Keefe
In reply to this post by Pooja Desai
GCC 4.9.2 is rather elderly.  According to gcc.gnu.org,
"Newer Solaris versions provide one or more of GCC 5, 7, and 9."
where "Newer" means 11.4 and later.

GCC 10.1 was released this month.  If I were you I'd upgrade to
GCC 9.3 if possible.


On Mon, 11 May 2020 at 18:10, Pooja Desai <[hidden email]> wrote:
Hi,

Facing erlang core issue on solaris SPARC setup while running RabbitMQ

(dbx) where

=>[1] cpool_insert(0x1004efd40, 0xffffffff75600000, 0x61850, 0xffffffff75600018, 0x90f, 0x1004effd0), at 0x10006db14

  [2] abandon_carrier(0x1004efd40, 0xffffffff75600000, 0xffffffff75645ec0, 0xffffffff77d03818, 0x0, 0x6), at 0x10006de3c

  [3] 17(0x1004efd40, 0xcb3, 0x2, 0xffffffff75645e60, 0x0, 0x1004efd40), at 0x10006e958

  [4] erts_alcu_check_delayed_dealloc(0x1004efd40, 0x1, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x8000000000000007), at 0x100075244

  [5] erts_alloc_scheduler_handle_delayed_dealloc(0xffffffff3a82a620, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x100464, 0xffffffff3a82a5d0),

at 0x1000622c0

  [6] handle_aux_work(0xffffffff3a8204a0, 0x2, 0x1, 0x2, 0x100400, 0x4e5ce123), at 0x1002a6044

  [7] erts_schedule(0xffffffff3a820380, 0x9, 0x9, 0xffffffff3a81fc80, 0x2, 0x2), at 0x1002a3040

  [8] process_main(0x100469, 0xffffffff3a302240, 0xfa0, 0x802a, 0xffffffff38f00438, 0x3), at 0x1002901bc

  [9] sched_thread_func(0xffffffff3a820380, 0x0, 0x0, 0xffffffff7a911240, 0x100000, 0x1), at 0x100038f08

  [10] thr_wrapper(0xffffffff7fffc278, 0x0, 0x0, 0x100289d48, 0xffffffff3a820380, 0x100038da0), at 0x100289dc8

 

This issue is extremely intermittent so I am not able to reproduce it with debug build. But on our test setup I have seen this core twice only for solaris Sparc server for other servers (RHEL, Suse linux, Solarisx86, Windows etc.) with similar test environment things are working fine.

In two instances when I faced this issue we are restarting Rabbitmq server. i.e. stop RabbitMQ and epmd then run startup script for rabbitmq. This performs 2 operations,

First ping rabbitmq using "rabbitmqctl ping" to confirm rabbitmq is not already running ( I guess in background this will also start epmd) and then start rabbitmq-server in detached mode.

Core is generated while starting this demon.


I checked code around abandon_carrier("https://github.com/erlang/otp/blame/master/erts/emulator/beam/erl_alloc_util.c") but nothing changed in that area recently. So I am really clueless situation.

Please le me know if anyone faced similar issue in past or have any idea around this. Using OTP version 22.2 and RabbitMQ version 3.7.23.

Let me know any further information is required, pasting full core dump information below:

debugging core file of beam.smp (64-bit) from hostname01
file: temp_dir/erlang/erts-10.6/bin/beam.smp
initial argv:
/temp_dir/erlang/erts-10.6/bin/beam.smp -- -root /temp_dir/
threading model: native threads
status: process terminated by SIGSEGV (Segmentation Fault), addr=
ffffffff004631b0

C++ symbol demangling enabled

# stack

cpool_insert+0xd0(10051c500, ffffffff7a400000, ffffffff7a441de8, ffffffff7c903818, 0, 23)
dealloc_block.part.17+0x1c0(10051c500, cb3, 2, ffffffff7a441d88, 0, 10051c500)
erts_alcu_check_delayed_dealloc+0xe4(10051c500, 1, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 8000000000000007)
erts_alloc_scheduler_handle_delayed_dealloc+0x34(ffffffff3b729c20, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 100464, ffffffff3b729bd0)
handle_aux_work+0xa50(ffffffff3b71faa0, 402, 1, 402, 100400, 42da0c68)
erts_schedule+0x192c(ffffffff3b71f980, 9, 9, ffffffff3b71f280, 402, 2)
process_main+0xc4(100469, ffffffff3b202240, fa0, ffffffff3b71f980, 241, 100294204)
sched_thread_func+0x168(ffffffff3b71f980, 0, 0, ffffffff39401a40, 100000, 1)
thr_wrapper+0x80(ffffffff7fffb318, 0, 0, 100289d48, ffffffff3b71f980, 100038da0)
libc.so.1`_lwp_start(0, 0, 0, 0, 0, 0)

#############################################################################

# registers

%g0 = 0x0000000000000000                 %l0 = 0xffffffff7a4307a0
%g1 = 0xffffffff004631a1                 %l1 = 0x0000000000000000
%g2 = 0x0000000000000000                 %l2 = 0x0000000000000000
%g3 = 0x000000010051c798                 %l3 = 0x0000000000000000
%g4 = 0xffffffff004631a0                 %l4 = 0x0000000000000000
%g5 = 0x00000001004631a0 beam.smp`firstfit_carrier_pool %l5 = 0x0000000000000000
%g6 = 0x0000000000000000                 %l6 = 0x0000000000000000
%g7 = 0xffffffff39401a40                 %l7 = 0x0000000000000000
%o0 = 0x000000010051c500                 %i0 = 0x000000010051c500
%o1 = 0xffffffff7a400000                 %i1 = 0xffffffff7a400000
%o2 = 0x00000000000676c0                 %i2 = 0xffffffff7a441de8
%o3 = 0xffffffff7a400018                 %i3 = 0xffffffff7c903818
%o4 = 0x00000000000007b9                 %i4 = 0x0000000000000000
%o5 = 0x000000010051c790                 %i5 = 0x0000000000000023
%o6 = 0xffffffff7c902eb1                 %i6 = 0xffffffff7c902f61
%o7 = 0x000000010006de3c abandon_carrier+0x118 %i7 = 0x000000010006e958 dealloc_block.part.17+0x1c0

 %ccr = 0x44 xcc=nZvc icc=nZvc
   %y = 0x0000000000000000
  %pc = 0x000000010006db14 cpool_insert+0xd0
 %npc = 0x000000010006db18 cpool_insert+0xd4
  %sp = 0xffffffff7c902eb1
  %fp = 0xffffffff7c902f61

 %asi = 0x82
%fprs = 0x00

# dissassembly around pc

cpool_insert+0xa8:              mov       %g1, %g2
cpool_insert+0xac:              ldx       [%g5 + 0x10], %g1
cpool_insert+0xb0:              membar    #LoadLoad|#LoadStore
cpool_insert+0xb4:              ba,pt     %xcc, +0x1c   <cpool_insert+0xd0>
cpool_insert+0xb8:              and       %g1, -0x4, %g4
cpool_insert+0xbc:              membar    #LoadLoad|#LoadStore
cpool_insert+0xc0:              and       %g2, 0x3, %g3
cpool_insert+0xc4:              brz,pn    %g3, +0x1ec   <cpool_insert+0x2b0>
cpool_insert+0xc8:              mov       %g2, %g1
cpool_insert+0xcc:              and       %g1, -0x4, %g4
cpool_insert+0xd0:              ld        [%g4 + 0x10], %g1
cpool_insert+0xd4:              ld        [%g4 + 0x14], %g2
cpool_insert+0xd8:              sllx      %g1, 0x20, %g1
cpool_insert+0xdc:              cmp       %g5, %g4
cpool_insert+0xe0:              bne,pt    %xcc, -0x24   <cpool_insert+0xbc>
cpool_insert+0xe4:              or        %g2, %g1, %g2
cpool_insert+0xe8:              membar    #LoadLoad|#LoadStore
cpool_insert+0xec:              mov       0x100, %o5
cpool_insert+0xf0:              add       %g4, 0x10, %o4
cpool_insert+0xf4:              mov       %g2, %g3
cpool_insert+0xf8:              and       %g3, 0x1, %g1

# all threads

stack pointer for thread 1: ffffffff7fffa961
[ ffffffff7fffa961 libc.so.1`__pollsys+8() ]
  ffffffff7fffaa11 libc.so.1`pselect+0x1fc()
  ffffffff7fffaad1 libc.so.1`select+0xa4()
  ffffffff7fffab91 erts_sys_main_thread+0x24()
  ffffffff7fffac41 erl_start+0x232c()
  ffffffff7fffb0f1 main+0xc()
  ffffffff7fffb1a1 _start+0x7c()
stack pointer for thread 2: ffffffff396fb501
[ ffffffff396fb501 libc.so.1`__read+0xc() ]
  ffffffff396fb5b1 signal_dispatcher_thread_func+0x58()
  ffffffff396fb691 thr_wrapper+0x80()
  ffffffff396fb751 libc.so.1`_lwp_start()
stack pointer for thread 3: ffffffff7d6fb1f1
[ ffffffff7d6fb1f1 libc.so.1`__lwp_park+0x14() ]
  ffffffff7d6fb2b1 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7d6fb361 libc.so.1`cond_wait+0x10()
  ffffffff7d6fb411 libc.so.1`pthread_cond_wait+8()
  ffffffff7d6fb4c1 ethr_cond_wait+8()
  ffffffff7d6fb571 sys_msg_dispatcher_func+0x1c0()
  ffffffff7d6fb691 thr_wrapper+0x80()
  ffffffff7d6fb751 libc.so.1`_lwp_start()
stack pointer for thread 4: ffffffff7d21f201
[ ffffffff7d21f201 libc.so.1`__lwp_park+0x14() ]
  ffffffff7d21f2c1 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7d21f371 libc.so.1`cond_wait+0x10()
  ffffffff7d21f421 libc.so.1`pthread_cond_wait+8()
  ffffffff7d21f4d1 wait__.constprop.1+0x3c8()
  ffffffff7d223591 async_main+0x2f0()
  ffffffff7d223691 thr_wrapper+0x80()
  ffffffff7d223751 libc.so.1`_lwp_start()
stack pointer for thread 5: ffffffff7c902eb1
[ ffffffff7c902eb1 cpool_insert+0xd0() ]
  ffffffff7c902f61 dealloc_block.part.17+0x1c0()
  ffffffff7c903021 erts_alcu_check_delayed_dealloc+0xe4()
  ffffffff7c9030e1 erts_alloc_scheduler_handle_delayed_dealloc+0x34()
  ffffffff7c903191 handle_aux_work+0xa50()
  ffffffff7c903251 erts_schedule+0x192c()
  ffffffff7c9033c1 process_main+0xc4()
  ffffffff7c9035b1 sched_thread_func+0x168()
  ffffffff7c903691 thr_wrapper+0x80()
  ffffffff7c903751 libc.so.1`_lwp_start()
stack pointer for thread 6: ffffffff7c703141
[ ffffffff7c703141 erts_find_export_entry+0x7c() ]
  ffffffff7c703301 prepare_loading_2+0x68()
  ffffffff7c7033c1 process_main+0xcf0()
  ffffffff7c7035b1 sched_thread_func+0x168()
  ffffffff7c703691 thr_wrapper+0x80()
  ffffffff7c703751 libc.so.1`_lwp_start()
stack pointer for thread 7: ffffffff7befed41
[ ffffffff7befed41 libc.so.1`__lwp_park+0x14() ]
  ffffffff7befee01 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7befeeb1 libc.so.1`cond_wait_common+0x28c()
  ffffffff7befef71 libc.so.1`__cond_timedwait+0x8c()
  ffffffff7beff031 libc.so.1`cond_timedwait+0x14()
  ffffffff7beff0e1 libc.so.1`pthread_cond_timedwait+0xc()
  ffffffff7beff191 wait__.constprop.1+0x308()
  ffffffff7bf03251 erts_schedule+0x1de0()
  ffffffff7bf033c1 process_main+0xc4()
  ffffffff7bf035b1 sched_thread_func+0x168()
  ffffffff7bf03691 thr_wrapper+0x80()
  ffffffff7bf03751 libc.so.1`_lwp_start()
stack pointer for thread 8: ffffffff7bd02eb1
[ ffffffff7bd02eb1 mbc_free+0x174() ]
  ffffffff7bd02f61 dealloc_block.part.17+0x1c0()
  ffffffff7bd03021 erts_alcu_check_delayed_dealloc+0xe4()
  ffffffff7bd030e1 erts_alloc_scheduler_handle_delayed_dealloc+0x34()
  ffffffff7bd03191 handle_aux_work+0xa50()
  ffffffff7bd03251 erts_schedule+0x192c()
  ffffffff7bd033c1 process_main+0xc4()
  ffffffff7bd035b1 sched_thread_func+0x168()
  ffffffff7bd03691 thr_wrapper+0x80()
  ffffffff7bd03751 libc.so.1`_lwp_start()
stack pointer for thread 9: ffffffff7bb4eff1
[ ffffffff7bb4eff1 libc.so.1`__lwp_park+0x14() ]
  ffffffff7bb4f0b1 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7bb4f161 libc.so.1`cond_wait+0x10()
  ffffffff7bb4f211 libc.so.1`pthread_cond_wait+8()
  ffffffff7bb4f2c1 wait__.constprop.1+0x3c8()
  ffffffff7bb53381 erts_schedule+0x1de0()
  ffffffff7bb534f1 erts_dirty_process_main+0x1dc()
  ffffffff7bb535b1 sched_dirty_cpu_thread_func+0xd0()
  ffffffff7bb53691 thr_wrapper+0x80()
  ffffffff7bb53751 libc.so.1`_lwp_start()
stack pointer for thread a: ffffffff7ba4eff1
[ ffffffff7ba4eff1 libc.so.1`__lwp_park+0x14() ]
  ffffffff7ba4f0b1 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7ba4f161 libc.so.1`cond_wait+0x10()
  ffffffff7ba4f211 libc.so.1`pthread_cond_wait+8()
  ffffffff7ba4f2c1 wait__.constprop.1+0x3c8()
  ffffffff7ba53381 erts_schedule+0x1de0()
  ffffffff7ba534f1 erts_dirty_process_main+0x1dc()
  ffffffff7ba535b1 sched_dirty_cpu_thread_func+0xd0()
  ffffffff7ba53691 thr_wrapper+0x80()
  ffffffff7ba53751 libc.so.1`_lwp_start()
stack pointer for thread b: ffffffff7b94eff1
[ ffffffff7b94eff1 libc.so.1`__lwp_park+0x14() ]
  ffffffff7b94f0b1 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7b94f161 libc.so.1`cond_wait+0x10()
  ffffffff7b94f211 libc.so.1`pthread_cond_wait+8()
  ffffffff7b94f2c1 wait__.constprop.1+0x3c8()
  ffffffff7b953381 erts_schedule+0x1de0()
  ffffffff7b9534f1 erts_dirty_process_main+0x1dc()
  ffffffff7b9535b1 sched_dirty_cpu_thread_func+0xd0()
  ffffffff7b953691 thr_wrapper+0x80()
  ffffffff7b953751 libc.so.1`_lwp_start()
stack pointer for thread c: ffffffff7b84eff1
[ ffffffff7b84eff1 libc.so.1`__lwp_park+0x14() ]
  ffffffff7b84f0b1 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7b84f161 libc.so.1`cond_wait+0x10()
  ffffffff7b84f211 libc.so.1`pthread_cond_wait+8()
  ffffffff7b84f2c1 wait__.constprop.1+0x3c8()
  ffffffff7b853381 erts_schedule+0x1de0()
  ffffffff7b8534f1 erts_dirty_process_main+0x1dc()
  ffffffff7b8535b1 sched_dirty_cpu_thread_func+0xd0()
  ffffffff7b853691 thr_wrapper+0x80()
  ffffffff7b853751 libc.so.1`_lwp_start()
stack pointer for thread d: ffffffff7b74eff1
[ ffffffff7b74eff1 libc.so.1`__lwp_park+0x14() ]
  ffffffff7b74f0b1 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7b74f161 libc.so.1`cond_wait+0x10()
  ffffffff7b74f211 libc.so.1`pthread_cond_wait+8()
  ffffffff7b74f2c1 wait__.constprop.1+0x3c8()
  ffffffff7b753381 erts_schedule+0x1de0()
  ffffffff7b7534f1 erts_dirty_process_main+0x78()
  ffffffff7b7535b1 sched_dirty_io_thread_func+0xe4()
  ffffffff7b753691 thr_wrapper+0x80()
  ffffffff7b753751 libc.so.1`_lwp_start()
stack pointer for thread e: ffffffff7b64eff1
[ ffffffff7b64eff1 libc.so.1`__lwp_park+0x14() ]
  ffffffff7b64f0b1 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7b64f161 libc.so.1`cond_wait+0x10()
  ffffffff7b64f211 libc.so.1`pthread_cond_wait+8()
  ffffffff7b64f2c1 wait__.constprop.1+0x3c8()
  ffffffff7b653381 erts_schedule+0x1de0()
  ffffffff7b6534f1 erts_dirty_process_main+0x78()
  ffffffff7b6535b1 sched_dirty_io_thread_func+0xe4()
  ffffffff7b653691 thr_wrapper+0x80()
  ffffffff7b653751 libc.so.1`_lwp_start()
stack pointer for thread 10: ffffffff7b44eff1
[ ffffffff7b44eff1 libc.so.1`__lwp_park+0x14() ]
  ffffffff7b44f0b1 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7b44f161 libc.so.1`cond_wait+0x10()
  ffffffff7b44f211 libc.so.1`pthread_cond_wait+8()
  ffffffff7b44f2c1 wait__.constprop.1+0x3c8()
  ffffffff7b453381 erts_schedule+0x1de0()
  ffffffff7b4534f1 erts_dirty_process_main+0x288()
  ffffffff7b4535b1 sched_dirty_io_thread_func+0xe4()
  ffffffff7b453691 thr_wrapper+0x80()
  ffffffff7b453751 libc.so.1`_lwp_start()
stack pointer for thread 11: ffffffff7b34eff1
[ ffffffff7b34eff1 libc.so.1`__lwp_park+0x14() ]
  ffffffff7b34f0b1 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7b34f161 libc.so.1`cond_wait+0x10()
  ffffffff7b34f211 libc.so.1`pthread_cond_wait+8()
  ffffffff7b34f2c1 wait__.constprop.1+0x3c8()
  ffffffff7b353381 erts_schedule+0x1de0()
  ffffffff7b3534f1 erts_dirty_process_main+0x288()
  ffffffff7b3535b1 sched_dirty_io_thread_func+0xe4()
  ffffffff7b353691 thr_wrapper+0x80()
  ffffffff7b353751 libc.so.1`_lwp_start()
stack pointer for thread 12: ffffffff7b24eff1
[ ffffffff7b24eff1 libc.so.1`__lwp_park+0x14() ]
  ffffffff7b24f0b1 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7b24f161 libc.so.1`cond_wait+0x10()
  ffffffff7b24f211 libc.so.1`pthread_cond_wait+8()
  ffffffff7b24f2c1 wait__.constprop.1+0x3c8()
  ffffffff7b253381 erts_schedule+0x1de0()
  ffffffff7b2534f1 erts_dirty_process_main+0x288()
  ffffffff7b2535b1 sched_dirty_io_thread_func+0xe4()
  ffffffff7b253691 thr_wrapper+0x80()
  ffffffff7b253751 libc.so.1`_lwp_start()
stack pointer for thread 13: ffffffff7b1532d1
[ ffffffff7b1532d1 sched_spin_wait+0x17c() ]
  ffffffff7b153381 erts_schedule+0x19d0()
  ffffffff7b1534f1 erts_dirty_process_main+0x78()
  ffffffff7b1535b1 sched_dirty_io_thread_func+0xe4()
  ffffffff7b153691 thr_wrapper+0x80()
  ffffffff7b153751 libc.so.1`_lwp_start()
stack pointer for thread 14: ffffffff7b04eff1
[ ffffffff7b04eff1 libc.so.1`__lwp_park+0x14() ]
  ffffffff7b04f0b1 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7b04f161 libc.so.1`cond_wait+0x10()
  ffffffff7b04f211 libc.so.1`pthread_cond_wait+8()
  ffffffff7b04f2c1 wait__.constprop.1+0x3c8()
  ffffffff7b053381 erts_schedule+0x1de0()
  ffffffff7b0534f1 erts_dirty_process_main+0x78()
  ffffffff7b0535b1 sched_dirty_io_thread_func+0xe4()
  ffffffff7b053691 thr_wrapper+0x80()
  ffffffff7b053751 libc.so.1`_lwp_start()
stack pointer for thread 15: ffffffff7af4eff1
[ ffffffff7af4eff1 libc.so.1`__lwp_park+0x14() ]
  ffffffff7af4f0b1 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7af4f161 libc.so.1`cond_wait+0x10()
  ffffffff7af4f211 libc.so.1`pthread_cond_wait+8()
  ffffffff7af4f2c1 wait__.constprop.1+0x3c8()
  ffffffff7af53381 erts_schedule+0x1de0()
  ffffffff7af534f1 erts_dirty_process_main+0x78()
  ffffffff7af535b1 sched_dirty_io_thread_func+0xe4()
  ffffffff7af53691 thr_wrapper+0x80()
  ffffffff7af53751 libc.so.1`_lwp_start()
stack pointer for thread 16: ffffffff7ae4eff1
[ ffffffff7ae4eff1 libc.so.1`__lwp_park+0x14() ]
  ffffffff7ae4f0b1 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7ae4f161 libc.so.1`cond_wait+0x10()
  ffffffff7ae4f211 libc.so.1`pthread_cond_wait+8()
  ffffffff7ae4f2c1 wait__.constprop.1+0x3c8()
  ffffffff7ae53381 erts_schedule+0x1de0()
  ffffffff7ae534f1 erts_dirty_process_main+0x78()
  ffffffff7ae535b1 sched_dirty_io_thread_func+0xe4()
  ffffffff7ae53691 thr_wrapper+0x80()
  ffffffff7ae53751 libc.so.1`_lwp_start()
stack pointer for thread 17: ffffffff7ad4f221
[ ffffffff7ad4f221 libc.so.1`__lwp_park+0x14() ]
  ffffffff7ad4f2e1 libc.so.1`cond_wait_queue+0x4c()
  ffffffff7ad4f391 libc.so.1`cond_wait+0x10()
  ffffffff7ad4f441 libc.so.1`pthread_cond_wait+8()
  ffffffff7ad4f4f1 wait__.constprop.1+0x3c8()
  ffffffff7ad535b1 aux_thread+0x2ec()
  ffffffff7ad53691 thr_wrapper+0x80()
  ffffffff7ad53751 libc.so.1`_lwp_start()
stack pointer for thread 18: ffffffff7ac533d1
[ ffffffff7ac533d1 libc.so.1`ioctl+0xc() ]
  ffffffff7ac534c1 erts_check_io+0x54()
  ffffffff7ac535b1 poll_thread+0x208()
  ffffffff7ac53691 thr_wrapper+0x80()
  ffffffff7ac53751 libc.so.1`_lwp_start()

# object mappings

            BASE            LIMIT             SIZE NAME
       100000000        100340000           340000 temp_dir/erlang/erts-10.6/bin/beam.smp
ffffffff73c00000 ffffffff73c02000             2000 /lib/sparcv9/libsendfile.so.1
ffffffff3c700000 ffffffff3c706000             6000 /lib/sparcv9/libdlpi.so.1
ffffffff786fe000 ffffffff78700000             2000 /lib/sparcv9/libdl.so.1
ffffffff7f200000 ffffffff7f2b0000            b0000 /lib/sparcv9/libm.so.2
ffffffff7f000000 ffffffff7f010000            10000 /lib/sparcv9/libsocket.so.1
ffffffff7ee00000 ffffffff7ee70000            70000 /lib/sparcv9/libnsl.so.1
ffffffff75f00000 ffffffff75f02000             2000 /lib/sparcv9/libkstat.so.1
ffffffff7eafc000 ffffffff7eb00000             4000 /lib/sparcv9/libpthread.so.1
ffffffff786f8000 ffffffff786fa000             2000 /lib/sparcv9/librt.so.1
ffffffff7eb00000 ffffffff7ec80000           180000 /lib/sparcv9/libc.so.1
ffffffff7e000000 ffffffff7e400000           400000 /usr/lib/locale/en_US.UTF-8/sparcv9/en_US.UTF-8.so.3
ffffffff7de00000 ffffffff7de10000            10000 /usr/lib/locale/en_US.UTF-8/sparcv9/methods_unicode.so.3
ffffffff66900000 ffffffff66902000             2000 /usr/lib/sparcv9/libsctp.so.1
ffffffff7cc00000 ffffffff7cd30000           130000 /lib/sparcv9/libucrypto.so.1
ffffffff7c200000 ffffffff7c210000            10000 /lib/sparcv9/libcryptoutil.so.1
ffffffff7c000000 ffffffff7c030000            30000 /lib/sparcv9/libelf.so.1
ffffffff7ca00000 ffffffff7ca10000            10000 /lib/sparcv9/libz.so.1
ffffffff75d00000 ffffffff75d04000             4000 /lib/sparcv9/libmp.so.2
ffffffff7f500000 ffffffff7f540000            40000 /lib/sparcv9/ld.so.1

# machine information
Hostname: hostname01
Release: 5.11
Kernel architecture: sun4v
Application architecture: sparcv9
Kernel version: SunOS 5.11 sun4v 11.3
Platform: sun4v



argv[0]: /temp_dir/erlang/erts-10.6/bin/beam.smp
argv[1]: --
argv[2]: -root
argv[3]: /temp_dir/erlang
argv[4]: -progname
argv[5]: erl
argv[6]: --
argv[7]: -home
argv[8]: shared/global/mqbroker/mqhome
argv[9]: -epmd_port
argv[10]: 13778
argv[11]: --
argv[12]: -boot
argv[13]: no_dot_erlang
argv[14]: -sname
argv[15]: epmd-starter-25205088
argv[16]: -noshell
argv[17]: -noinput
argv[18]: -s
argv[19]: erlang
argv[20]: halt
argv[21]: --


# uname -a
SunOS hostname01 5.11 11.3 sun4v sparc sun4v


Thanks,

Pooja

Reply | Threaded
Open this post in threaded view
|

Re: erlang (rabbitmq) generating core on Solaris SPARC

Mikael Pettersson-5
In reply to this post by Mikael Pettersson-5
On Thu, May 14, 2020 at 12:09 PM Mikael Pettersson <[hidden email]> wrote:

>
> On Thu, May 14, 2020 at 9:32 AM Pooja Desai <[hidden email]> wrote:
> >
> > Hi Mikael,
> >
> >
> > Please find flies you requested in attachment as erl_files.tar.gz (compressed as facing issue with mail size)
> >
> > Normal build option is:
> >
> > # gcc  -Werror=undef -Werror=implicit -Werror=return-type  -m64 -g  -O3 -fomit-frame-pointer -Ierlang/src/solaris/otp/erts/sparc-sun-solaris2.10  -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -fno-tree-copyrename   -DHAVE_CONFIG_H -Wall -Wstrict-prototypes -Wmissing-prototypes -Wdeclaration-after-statement -DUSE_THREADS -D_THREAD_SAFE -D_REENTRANT -DPOSIX_THREADS -D_POSIX_PTHREAD_SEMANTICS   -Isparc-sun-solaris2.10/opt/smp -Ibeam -Isys/unix -Isys/common -Isparc-sun-solaris2.10 -Izlib  -Ipcre -Ihipe -I../include -I../include/sparc-sun-solaris2.10 -I../include/internal -I../include/internal/sparc-sun-solaris2.10 -c beam/erl_alloc_util.c -o obj/sparc-sun-solaris2.10/opt/smp/erl_alloc_util.o
> >
> > after your suggestion I updated it as below to generate erl_alloc_util file:
> >
> > # gcc  -Werror=undef -Werror=implicit -Werror=return-type  -m64 -g  -O3 -fomit-frame-pointer -Ierlang/src/solaris/otp/erts/sparc-sun-solaris2.10  -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -fno-tree-copyrename   -DHAVE_CONFIG_H -Wall -Wstrict-prototypes -Wmissing-prototypes -Wdeclaration-after-statement -DUSE_THREADS -D_THREAD_SAFE -D_REENTRANT -DPOSIX_THREADS -D_POSIX_PTHREAD_SEMANTICS   -Isparc-sun-solaris2.10/opt/smp -Ibeam -Isys/unix -Isys/common -Isparc-sun-solaris2.10 -Izlib  -Ipcre -Ihipe -I../include -I../include/sparc-sun-solaris2.10 -I../include/internal -I../include/internal/sparc-sun-solaris2.10 -E beam/erl_alloc_util.c -o obj/sparc-sun-solaris2.10/opt/smp/erl_alloc_util.i
> >
> > Also one thing I missed to mention, we are using gcc version 4.9.2 (GCC) for building on solaris SPARC as erlang doesn't support Sun's native compiler.
>
> I've been able to reproduce the non-atomic code for those 64-bit loads
> in cpool_insert() using gcc-4.9 cross compilers to sparc64-linux, but
> gcc-5.5/6.5/7.5/8.4/9.3 all emit correct code as far as I can tell.
>
> So the solution is to upgrade your gcc (I suggest 9.3.0) and rebuild
> your Erlang/OTP VM with that.
>
> /Mikael

I created a reduced test case from erl_alloc.i, and it turns out
Erlang/OTP was hit by
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70424, which affects
gcc-4.9 (all versions) and gcc-5.x (x < 4), on all strict-alignment
targets.

So the recommendation stands: upgrade your gcc.

> > Thanks & Regards,
> > Pooja
> >
> > On Tue, May 12, 2020 at 10:44 PM Mikael Pettersson <[hidden email]> wrote:
> >>
> >> On Tue, May 12, 2020 at 4:18 PM Pooja Desai <[hidden email]> wrote:
> >> >
> >> > Hi,
> >> >
> >> >
> >> >
> >> > Thanks for response Mikael
> >> >
> >> > As per your suggestion I am trying to write similar code to conclude if there is some issue with Solaris SPARC compiler.
> >> >
> >> >
> >> >
> >> > But I have some doubts,
> >> >
> >> > 1.     If there is problem with compiler then we should be able to see this crash everywhere else also, any idea why its only reproduced here?
> >> >
> >> > 2.     As I understand your explanation it reads 64 bits by assembling two adjacent 32 bits fields. Will it really cause problem in multi-threaded program? Considering while context switching to another thread, OS will save current context of the thread (and hence registers) and will bring back when thread is active again.
> >> >
> >> >
> >>
> >> Breaking up a 64-bit load into two 32-bit loads loses atomicity with
> >> any concurrent store into that location, meaning the read may end up
> >> observing a result composed of 32 bit from the old value and 32 bit
> >> from the newly stored value, whereas the code expects to see either
> >> the old or the new, but never this mixture.  This can happen also on a
> >> single-threaded CPU with preemptive multitasking.
> >>
> >> To move forward on the issue, I think you need to recreate the
> >> pre-processed source for erl_alloc_util.c.  To do that:
> >> 1. Compile Erlang/OTP as usual, starting from a pristine source
> >> directory (no left-overs from a previous build, best is to start fresh
> >> somewhere), but pass "V=1" to make.  Save the output from "make" in a
> >> file.
> >> 2. Note the step where it compiles erl_alloc_util.c.
> >> 3. Reexecute that step, but replace any "-c" with "-E" and "-o
> >> erl_alloc_util.o" with "-o erl_alloc_util.i".
> >> 4. Please send this ".i" file, together with the exact build steps and
> >> configuration options you used, and
> >> "erts/sparc-sun-solaris11/config.h" (I'm guessing the file name here)
> >> to me.
> >>
> >> My theory is that Erlang/OTP selects the wrong low-level primitives
> >> for this platform.
> >>
> >>
> >> >
> >> >
> >> > Thanks & Regards,
> >> >
> >> > Pooja
> >> >
> >> >
> >> > On Mon, May 11, 2020 at 10:36 PM Mikael Pettersson <[hidden email]> wrote:
> >> >>
> >> >> Hello Pooja,
> >> >>
> >> >> On Mon, May 11, 2020 at 8:10 AM Pooja Desai <[hidden email]> wrote:
> >> >> >
> >> >> > Hi,
> >> >> >
> >> >> > Facing erlang core issue on solaris SPARC setup while running RabbitMQ
> >> >>
> >> >> This looks like a 64-bit build, but the code doesn't look similar to
> >> >> what I get with gcc-9.3, so I'm assuming you used Sun's compiler?
> >> >>
> >> >>
> >> >> > (dbx) where
> >> >> >
> >> >> > =>[1] cpool_insert(0x1004efd40, 0xffffffff75600000, 0x61850, 0xffffffff75600018, 0x90f, 0x1004effd0), at 0x10006db14
> >> >> >
> >> >> >   [2] abandon_carrier(0x1004efd40, 0xffffffff75600000, 0xffffffff75645ec0, 0xffffffff77d03818, 0x0, 0x6), at 0x10006de3c
> >> >> >
> >> >> >   [3] 17(0x1004efd40, 0xcb3, 0x2, 0xffffffff75645e60, 0x0, 0x1004efd40), at 0x10006e958
> >> >> >
> >> >> >   [4] erts_alcu_check_delayed_dealloc(0x1004efd40, 0x1, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x8000000000000007), at 0x100075244
> >> >> >
> >> >> >   [5] erts_alloc_scheduler_handle_delayed_dealloc(0xffffffff3a82a620, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x100464, 0xffffffff3a82a5d0),
> >> >> >
> >> >> > at 0x1000622c0
> >> >> >
> >> >> >   [6] handle_aux_work(0xffffffff3a8204a0, 0x2, 0x1, 0x2, 0x100400, 0x4e5ce123), at 0x1002a6044
> >> >> >
> >> >> >   [7] erts_schedule(0xffffffff3a820380, 0x9, 0x9, 0xffffffff3a81fc80, 0x2, 0x2), at 0x1002a3040
> >> >> >
> >> >> >   [8] process_main(0x100469, 0xffffffff3a302240, 0xfa0, 0x802a, 0xffffffff38f00438, 0x3), at 0x1002901bc
> >> >> >
> >> >> >   [9] sched_thread_func(0xffffffff3a820380, 0x0, 0x0, 0xffffffff7a911240, 0x100000, 0x1), at 0x100038f08
> >> >> >
> >> >> >   [10] thr_wrapper(0xffffffff7fffc278, 0x0, 0x0, 0x100289d48, 0xffffffff3a820380, 0x100038da0), at 0x100289dc8
> >> >> >
> >> >> >
> >> >> >
> >> >> > This issue is extremely intermittent so I am not able to reproduce it with debug build. But on our test setup I have seen this core twice only for solaris Sparc server for other servers (RHEL, Suse linux, Solarisx86, Windows etc.) with similar test environment things are working fine.
> >> >> >
> >> >> > In two instances when I faced this issue we are restarting Rabbitmq server. i.e. stop RabbitMQ and epmd then run startup script for rabbitmq. This performs 2 operations,
> >> >> >
> >> >> > First ping rabbitmq using "rabbitmqctl ping" to confirm rabbitmq is not already running ( I guess in background this will also start epmd) and then start rabbitmq-server in detached mode.
> >> >> >
> >> >> > Core is generated while starting this demon.
> >> >> >
> >> >> >
> >> >> > I checked code around abandon_carrier("https://github.com/erlang/otp/blame/master/erts/emulator/beam/erl_alloc_util.c") but nothing changed in that area recently. So I am really clueless situation.
> >> >> >
> >> >> > Please le me know if anyone faced similar issue in past or have any idea around this. Using OTP version 22.2 and RabbitMQ version 3.7.23.
> >> >> >
> >> >> > Let me know any further information is required, pasting full core dump information below:
> >> >> >
> >> >> > debugging core file of beam.smp (64-bit) from hostname01
> >> >> > file: temp_dir/erlang/erts-10.6/bin/beam.smp
> >> >> > initial argv:
> >> >> > /temp_dir/erlang/erts-10.6/bin/beam.smp -- -root /temp_dir/
> >> >> > threading model: native threads
> >> >> > status: process terminated by SIGSEGV (Segmentation Fault), addr=
> >> >> > ffffffff004631b0
> >> >>
> >> >> Ok, this tells us the address was unmapped.  (It's not an alignment
> >> >> fault, another common issue on SPARC.)
> >> >>
> >> >>
> >> >> >
> >> >> > C++ symbol demangling enabled
> >> >> >
> >> >> > # stack
> >> >> >
> >> >> > cpool_insert+0xd0(10051c500, ffffffff7a400000, ffffffff7a441de8, ffffffff7c903818, 0, 23)
> >> >> > dealloc_block.part.17+0x1c0(10051c500, cb3, 2, ffffffff7a441d88, 0, 10051c500)
> >> >> > erts_alcu_check_delayed_dealloc+0xe4(10051c500, 1, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 8000000000000007)
> >> >> > erts_alloc_scheduler_handle_delayed_dealloc+0x34(ffffffff3b729c20, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 100464, ffffffff3b729bd0)
> >> >> > handle_aux_work+0xa50(ffffffff3b71faa0, 402, 1, 402, 100400, 42da0c68)
> >> >> > erts_schedule+0x192c(ffffffff3b71f980, 9, 9, ffffffff3b71f280, 402, 2)
> >> >> > process_main+0xc4(100469, ffffffff3b202240, fa0, ffffffff3b71f980, 241, 100294204)
> >> >> > sched_thread_func+0x168(ffffffff3b71f980, 0, 0, ffffffff39401a40, 100000, 1)
> >> >> > thr_wrapper+0x80(ffffffff7fffb318, 0, 0, 100289d48, ffffffff3b71f980, 100038da0)
> >> >> > libc.so.1`_lwp_start(0, 0, 0, 0, 0, 0)
> >> >> >
> >> >> > #############################################################################
> >> >> >
> >> >> > # registers
> >> >> >
> >> >> > %g0 = 0x0000000000000000                 %l0 = 0xffffffff7a4307a0
> >> >> > %g1 = 0xffffffff004631a1                 %l1 = 0x0000000000000000
> >> >> > %g2 = 0x0000000000000000                 %l2 = 0x0000000000000000
> >> >> > %g3 = 0x000000010051c798                 %l3 = 0x0000000000000000
> >> >> > %g4 = 0xffffffff004631a0                 %l4 = 0x0000000000000000
> >> >> > %g5 = 0x00000001004631a0 beam.smp`firstfit_carrier_pool %l5 = 0x0000000000000000
> >> >>
> >> >> This is interesting.  Notice how the low 32-bits 004631a0 show up in
> >> >> three variations:
> >> >> 1. 00000001004631a0 beam.smp`firstfit_carrier_pool (the address of the
> >> >> firstfit_carrier_pool global variable)
> >> >> 2. ffffffff004631a0 (the above, but with the high 32 bits replaced
> >> >> with all-bits-one)
> >> >> 3. ffffffff004631a1 (the above, but with a tag in the low bit)
> >> >>
> >> >> > %g6 = 0x0000000000000000                 %l6 = 0x0000000000000000
> >> >> > %g7 = 0xffffffff39401a40                 %l7 = 0x0000000000000000
> >> >> > %o0 = 0x000000010051c500                 %i0 = 0x000000010051c500
> >> >> > %o1 = 0xffffffff7a400000                 %i1 = 0xffffffff7a400000
> >> >> > %o2 = 0x00000000000676c0                 %i2 = 0xffffffff7a441de8
> >> >> > %o3 = 0xffffffff7a400018                 %i3 = 0xffffffff7c903818
> >> >> > %o4 = 0x00000000000007b9                 %i4 = 0x0000000000000000
> >> >> > %o5 = 0x000000010051c790                 %i5 = 0x0000000000000023
> >> >> > %o6 = 0xffffffff7c902eb1                 %i6 = 0xffffffff7c902f61
> >> >> > %o7 = 0x000000010006de3c abandon_carrier+0x118 %i7 = 0x000000010006e958 dealloc_block.part.17+0x1c0
> >> >> >
> >> >> >  %ccr = 0x44 xcc=nZvc icc=nZvc
> >> >> >    %y = 0x0000000000000000
> >> >> >   %pc = 0x000000010006db14 cpool_insert+0xd0
> >> >> >  %npc = 0x000000010006db18 cpool_insert+0xd4
> >> >> >   %sp = 0xffffffff7c902eb1
> >> >> >   %fp = 0xffffffff7c902f61
> >> >> >
> >> >> >  %asi = 0x82
> >> >> > %fprs = 0x00
> >> >> >
> >> >> > # dissassembly around pc
> >> >> >
> >> >> > cpool_insert+0xa8:              mov       %g1, %g2
> >> >> > cpool_insert+0xac:              ldx       [%g5 + 0x10], %g1
> >> >> > cpool_insert+0xb0:              membar    #LoadLoad|#LoadStore
> >> >> > cpool_insert+0xb4:              ba,pt     %xcc, +0x1c   <cpool_insert+0xd0>
> >> >> > cpool_insert+0xb8:              and       %g1, -0x4, %g4
> >> >>
> >> >> > cpool_insert+0xbc:              membar    #LoadLoad|#LoadStore
> >> >> > cpool_insert+0xc0:              and       %g2, 0x3, %g3
> >> >> > cpool_insert+0xc4:              brz,pn    %g3, +0x1ec   <cpool_insert+0x2b0>
> >> >> > cpool_insert+0xc8:              mov       %g2, %g1
> >> >> > cpool_insert+0xcc:              and       %g1, -0x4, %g4
> >> >> > cpool_insert+0xd0:              ld        [%g4 + 0x10], %g1
> >> >>
> >> >> This is the faulting instruction. We're in the /* Find a predecessor
> >> >> to be, and set mod marker on its next ptr */ loop.
> >> >>
> >> >> > cpool_insert+0xd4:              ld        [%g4 + 0x14], %g2
> >> >> > cpool_insert+0xd8:              sllx      %g1, 0x20, %g1
> >> >> > cpool_insert+0xdc:              cmp       %g5, %g4
> >> >> > cpool_insert+0xe0:              bne,pt    %xcc, -0x24   <cpool_insert+0xbc>
> >> >> > cpool_insert+0xe4:              or        %g2, %g1, %g2
> >> >>
> >> >> The above reads a 64-bit "->next" pointer by assembling two adjacent
> >> >> 32-bit fields.  Weird, but arithmetically Ok.
> >> >>
> >> >> Two things strike me:
> >> >> 1. The compiler implements "atomic load of 64-bits" as "load 32 bits,
> >> >> load another 32 bits, combine", which isn't correct in a multithreaded
> >> >> program.  The error could be in the compiler, or in the source code.
> >> >> 2. In the register dump it was obvious that the high bits of an
> >> >> address had been clobbered.
> >> >>
> >> >> My suspicion is that either Sun's compiler is buggy, or Erlang is
> >> >> selecting non thread-safe code in this case.
> >> >>
> >> >> On SPARC64 Linux w/ GCC I get very different code that uses "ldx" for
> >> >> those 64-bit loads, as expected.
> >> >>
> >> >> /Mikael
Reply | Threaded
Open this post in threaded view
|

Re: erlang (rabbitmq) generating core on Solaris SPARC

Pooja Desai
Thanks Mikael,

As per your suggestion I am rebuilding erlang with newer gcc version. Thanks for helping with this.

Thanks & Regards,
Pooja

On Fri, May 15, 2020 at 3:20 AM Mikael Pettersson <[hidden email]> wrote:
On Thu, May 14, 2020 at 12:09 PM Mikael Pettersson <[hidden email]> wrote:
>
> On Thu, May 14, 2020 at 9:32 AM Pooja Desai <[hidden email]> wrote:
> >
> > Hi Mikael,
> >
> >
> > Please find flies you requested in attachment as erl_files.tar.gz (compressed as facing issue with mail size)
> >
> > Normal build option is:
> >
> > # gcc  -Werror=undef -Werror=implicit -Werror=return-type  -m64 -g  -O3 -fomit-frame-pointer -Ierlang/src/solaris/otp/erts/sparc-sun-solaris2.10  -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -fno-tree-copyrename   -DHAVE_CONFIG_H -Wall -Wstrict-prototypes -Wmissing-prototypes -Wdeclaration-after-statement -DUSE_THREADS -D_THREAD_SAFE -D_REENTRANT -DPOSIX_THREADS -D_POSIX_PTHREAD_SEMANTICS   -Isparc-sun-solaris2.10/opt/smp -Ibeam -Isys/unix -Isys/common -Isparc-sun-solaris2.10 -Izlib  -Ipcre -Ihipe -I../include -I../include/sparc-sun-solaris2.10 -I../include/internal -I../include/internal/sparc-sun-solaris2.10 -c beam/erl_alloc_util.c -o obj/sparc-sun-solaris2.10/opt/smp/erl_alloc_util.o
> >
> > after your suggestion I updated it as below to generate erl_alloc_util file:
> >
> > # gcc  -Werror=undef -Werror=implicit -Werror=return-type  -m64 -g  -O3 -fomit-frame-pointer -Ierlang/src/solaris/otp/erts/sparc-sun-solaris2.10  -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -fno-tree-copyrename   -DHAVE_CONFIG_H -Wall -Wstrict-prototypes -Wmissing-prototypes -Wdeclaration-after-statement -DUSE_THREADS -D_THREAD_SAFE -D_REENTRANT -DPOSIX_THREADS -D_POSIX_PTHREAD_SEMANTICS   -Isparc-sun-solaris2.10/opt/smp -Ibeam -Isys/unix -Isys/common -Isparc-sun-solaris2.10 -Izlib  -Ipcre -Ihipe -I../include -I../include/sparc-sun-solaris2.10 -I../include/internal -I../include/internal/sparc-sun-solaris2.10 -E beam/erl_alloc_util.c -o obj/sparc-sun-solaris2.10/opt/smp/erl_alloc_util.i
> >
> > Also one thing I missed to mention, we are using gcc version 4.9.2 (GCC) for building on solaris SPARC as erlang doesn't support Sun's native compiler.
>
> I've been able to reproduce the non-atomic code for those 64-bit loads
> in cpool_insert() using gcc-4.9 cross compilers to sparc64-linux, but
> gcc-5.5/6.5/7.5/8.4/9.3 all emit correct code as far as I can tell.
>
> So the solution is to upgrade your gcc (I suggest 9.3.0) and rebuild
> your Erlang/OTP VM with that.
>
> /Mikael

I created a reduced test case from erl_alloc.i, and it turns out
Erlang/OTP was hit by
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70424, which affects
gcc-4.9 (all versions) and gcc-5.x (x < 4), on all strict-alignment
targets.

So the recommendation stands: upgrade your gcc.

> > Thanks & Regards,
> > Pooja
> >
> > On Tue, May 12, 2020 at 10:44 PM Mikael Pettersson <[hidden email]> wrote:
> >>
> >> On Tue, May 12, 2020 at 4:18 PM Pooja Desai <[hidden email]> wrote:
> >> >
> >> > Hi,
> >> >
> >> >
> >> >
> >> > Thanks for response Mikael
> >> >
> >> > As per your suggestion I am trying to write similar code to conclude if there is some issue with Solaris SPARC compiler.
> >> >
> >> >
> >> >
> >> > But I have some doubts,
> >> >
> >> > 1.     If there is problem with compiler then we should be able to see this crash everywhere else also, any idea why its only reproduced here?
> >> >
> >> > 2.     As I understand your explanation it reads 64 bits by assembling two adjacent 32 bits fields. Will it really cause problem in multi-threaded program? Considering while context switching to another thread, OS will save current context of the thread (and hence registers) and will bring back when thread is active again.
> >> >
> >> >
> >>
> >> Breaking up a 64-bit load into two 32-bit loads loses atomicity with
> >> any concurrent store into that location, meaning the read may end up
> >> observing a result composed of 32 bit from the old value and 32 bit
> >> from the newly stored value, whereas the code expects to see either
> >> the old or the new, but never this mixture.  This can happen also on a
> >> single-threaded CPU with preemptive multitasking.
> >>
> >> To move forward on the issue, I think you need to recreate the
> >> pre-processed source for erl_alloc_util.c.  To do that:
> >> 1. Compile Erlang/OTP as usual, starting from a pristine source
> >> directory (no left-overs from a previous build, best is to start fresh
> >> somewhere), but pass "V=1" to make.  Save the output from "make" in a
> >> file.
> >> 2. Note the step where it compiles erl_alloc_util.c.
> >> 3. Reexecute that step, but replace any "-c" with "-E" and "-o
> >> erl_alloc_util.o" with "-o erl_alloc_util.i".
> >> 4. Please send this ".i" file, together with the exact build steps and
> >> configuration options you used, and
> >> "erts/sparc-sun-solaris11/config.h" (I'm guessing the file name here)
> >> to me.
> >>
> >> My theory is that Erlang/OTP selects the wrong low-level primitives
> >> for this platform.
> >>
> >>
> >> >
> >> >
> >> > Thanks & Regards,
> >> >
> >> > Pooja
> >> >
> >> >
> >> > On Mon, May 11, 2020 at 10:36 PM Mikael Pettersson <[hidden email]> wrote:
> >> >>
> >> >> Hello Pooja,
> >> >>
> >> >> On Mon, May 11, 2020 at 8:10 AM Pooja Desai <[hidden email]> wrote:
> >> >> >
> >> >> > Hi,
> >> >> >
> >> >> > Facing erlang core issue on solaris SPARC setup while running RabbitMQ
> >> >>
> >> >> This looks like a 64-bit build, but the code doesn't look similar to
> >> >> what I get with gcc-9.3, so I'm assuming you used Sun's compiler?
> >> >>
> >> >>
> >> >> > (dbx) where
> >> >> >
> >> >> > =>[1] cpool_insert(0x1004efd40, 0xffffffff75600000, 0x61850, 0xffffffff75600018, 0x90f, 0x1004effd0), at 0x10006db14
> >> >> >
> >> >> >   [2] abandon_carrier(0x1004efd40, 0xffffffff75600000, 0xffffffff75645ec0, 0xffffffff77d03818, 0x0, 0x6), at 0x10006de3c
> >> >> >
> >> >> >   [3] 17(0x1004efd40, 0xcb3, 0x2, 0xffffffff75645e60, 0x0, 0x1004efd40), at 0x10006e958
> >> >> >
> >> >> >   [4] erts_alcu_check_delayed_dealloc(0x1004efd40, 0x1, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x8000000000000007), at 0x100075244
> >> >> >
> >> >> >   [5] erts_alloc_scheduler_handle_delayed_dealloc(0xffffffff3a82a620, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x100464, 0xffffffff3a82a5d0),
> >> >> >
> >> >> > at 0x1000622c0
> >> >> >
> >> >> >   [6] handle_aux_work(0xffffffff3a8204a0, 0x2, 0x1, 0x2, 0x100400, 0x4e5ce123), at 0x1002a6044
> >> >> >
> >> >> >   [7] erts_schedule(0xffffffff3a820380, 0x9, 0x9, 0xffffffff3a81fc80, 0x2, 0x2), at 0x1002a3040
> >> >> >
> >> >> >   [8] process_main(0x100469, 0xffffffff3a302240, 0xfa0, 0x802a, 0xffffffff38f00438, 0x3), at 0x1002901bc
> >> >> >
> >> >> >   [9] sched_thread_func(0xffffffff3a820380, 0x0, 0x0, 0xffffffff7a911240, 0x100000, 0x1), at 0x100038f08
> >> >> >
> >> >> >   [10] thr_wrapper(0xffffffff7fffc278, 0x0, 0x0, 0x100289d48, 0xffffffff3a820380, 0x100038da0), at 0x100289dc8
> >> >> >
> >> >> >
> >> >> >
> >> >> > This issue is extremely intermittent so I am not able to reproduce it with debug build. But on our test setup I have seen this core twice only for solaris Sparc server for other servers (RHEL, Suse linux, Solarisx86, Windows etc.) with similar test environment things are working fine.
> >> >> >
> >> >> > In two instances when I faced this issue we are restarting Rabbitmq server. i.e. stop RabbitMQ and epmd then run startup script for rabbitmq. This performs 2 operations,
> >> >> >
> >> >> > First ping rabbitmq using "rabbitmqctl ping" to confirm rabbitmq is not already running ( I guess in background this will also start epmd) and then start rabbitmq-server in detached mode.
> >> >> >
> >> >> > Core is generated while starting this demon.
> >> >> >
> >> >> >
> >> >> > I checked code around abandon_carrier("https://github.com/erlang/otp/blame/master/erts/emulator/beam/erl_alloc_util.c") but nothing changed in that area recently. So I am really clueless situation.
> >> >> >
> >> >> > Please le me know if anyone faced similar issue in past or have any idea around this. Using OTP version 22.2 and RabbitMQ version 3.7.23.
> >> >> >
> >> >> > Let me know any further information is required, pasting full core dump information below:
> >> >> >
> >> >> > debugging core file of beam.smp (64-bit) from hostname01
> >> >> > file: temp_dir/erlang/erts-10.6/bin/beam.smp
> >> >> > initial argv:
> >> >> > /temp_dir/erlang/erts-10.6/bin/beam.smp -- -root /temp_dir/
> >> >> > threading model: native threads
> >> >> > status: process terminated by SIGSEGV (Segmentation Fault), addr=
> >> >> > ffffffff004631b0
> >> >>
> >> >> Ok, this tells us the address was unmapped.  (It's not an alignment
> >> >> fault, another common issue on SPARC.)
> >> >>
> >> >>
> >> >> >
> >> >> > C++ symbol demangling enabled
> >> >> >
> >> >> > # stack
> >> >> >
> >> >> > cpool_insert+0xd0(10051c500, ffffffff7a400000, ffffffff7a441de8, ffffffff7c903818, 0, 23)
> >> >> > dealloc_block.part.17+0x1c0(10051c500, cb3, 2, ffffffff7a441d88, 0, 10051c500)
> >> >> > erts_alcu_check_delayed_dealloc+0xe4(10051c500, 1, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 8000000000000007)
> >> >> > erts_alloc_scheduler_handle_delayed_dealloc+0x34(ffffffff3b729c20, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 100464, ffffffff3b729bd0)
> >> >> > handle_aux_work+0xa50(ffffffff3b71faa0, 402, 1, 402, 100400, 42da0c68)
> >> >> > erts_schedule+0x192c(ffffffff3b71f980, 9, 9, ffffffff3b71f280, 402, 2)
> >> >> > process_main+0xc4(100469, ffffffff3b202240, fa0, ffffffff3b71f980, 241, 100294204)
> >> >> > sched_thread_func+0x168(ffffffff3b71f980, 0, 0, ffffffff39401a40, 100000, 1)
> >> >> > thr_wrapper+0x80(ffffffff7fffb318, 0, 0, 100289d48, ffffffff3b71f980, 100038da0)
> >> >> > libc.so.1`_lwp_start(0, 0, 0, 0, 0, 0)
> >> >> >
> >> >> > #############################################################################
> >> >> >
> >> >> > # registers
> >> >> >
> >> >> > %g0 = 0x0000000000000000                 %l0 = 0xffffffff7a4307a0
> >> >> > %g1 = 0xffffffff004631a1                 %l1 = 0x0000000000000000
> >> >> > %g2 = 0x0000000000000000                 %l2 = 0x0000000000000000
> >> >> > %g3 = 0x000000010051c798                 %l3 = 0x0000000000000000
> >> >> > %g4 = 0xffffffff004631a0                 %l4 = 0x0000000000000000
> >> >> > %g5 = 0x00000001004631a0 beam.smp`firstfit_carrier_pool %l5 = 0x0000000000000000
> >> >>
> >> >> This is interesting.  Notice how the low 32-bits 004631a0 show up in
> >> >> three variations:
> >> >> 1. 00000001004631a0 beam.smp`firstfit_carrier_pool (the address of the
> >> >> firstfit_carrier_pool global variable)
> >> >> 2. ffffffff004631a0 (the above, but with the high 32 bits replaced
> >> >> with all-bits-one)
> >> >> 3. ffffffff004631a1 (the above, but with a tag in the low bit)
> >> >>
> >> >> > %g6 = 0x0000000000000000                 %l6 = 0x0000000000000000
> >> >> > %g7 = 0xffffffff39401a40                 %l7 = 0x0000000000000000
> >> >> > %o0 = 0x000000010051c500                 %i0 = 0x000000010051c500
> >> >> > %o1 = 0xffffffff7a400000                 %i1 = 0xffffffff7a400000
> >> >> > %o2 = 0x00000000000676c0                 %i2 = 0xffffffff7a441de8
> >> >> > %o3 = 0xffffffff7a400018                 %i3 = 0xffffffff7c903818
> >> >> > %o4 = 0x00000000000007b9                 %i4 = 0x0000000000000000
> >> >> > %o5 = 0x000000010051c790                 %i5 = 0x0000000000000023
> >> >> > %o6 = 0xffffffff7c902eb1                 %i6 = 0xffffffff7c902f61
> >> >> > %o7 = 0x000000010006de3c abandon_carrier+0x118 %i7 = 0x000000010006e958 dealloc_block.part.17+0x1c0
> >> >> >
> >> >> >  %ccr = 0x44 xcc=nZvc icc=nZvc
> >> >> >    %y = 0x0000000000000000
> >> >> >   %pc = 0x000000010006db14 cpool_insert+0xd0
> >> >> >  %npc = 0x000000010006db18 cpool_insert+0xd4
> >> >> >   %sp = 0xffffffff7c902eb1
> >> >> >   %fp = 0xffffffff7c902f61
> >> >> >
> >> >> >  %asi = 0x82
> >> >> > %fprs = 0x00
> >> >> >
> >> >> > # dissassembly around pc
> >> >> >
> >> >> > cpool_insert+0xa8:              mov       %g1, %g2
> >> >> > cpool_insert+0xac:              ldx       [%g5 + 0x10], %g1
> >> >> > cpool_insert+0xb0:              membar    #LoadLoad|#LoadStore
> >> >> > cpool_insert+0xb4:              ba,pt     %xcc, +0x1c   <cpool_insert+0xd0>
> >> >> > cpool_insert+0xb8:              and       %g1, -0x4, %g4
> >> >>
> >> >> > cpool_insert+0xbc:              membar    #LoadLoad|#LoadStore
> >> >> > cpool_insert+0xc0:              and       %g2, 0x3, %g3
> >> >> > cpool_insert+0xc4:              brz,pn    %g3, +0x1ec   <cpool_insert+0x2b0>
> >> >> > cpool_insert+0xc8:              mov       %g2, %g1
> >> >> > cpool_insert+0xcc:              and       %g1, -0x4, %g4
> >> >> > cpool_insert+0xd0:              ld        [%g4 + 0x10], %g1
> >> >>
> >> >> This is the faulting instruction. We're in the /* Find a predecessor
> >> >> to be, and set mod marker on its next ptr */ loop.
> >> >>
> >> >> > cpool_insert+0xd4:              ld        [%g4 + 0x14], %g2
> >> >> > cpool_insert+0xd8:              sllx      %g1, 0x20, %g1
> >> >> > cpool_insert+0xdc:              cmp       %g5, %g4
> >> >> > cpool_insert+0xe0:              bne,pt    %xcc, -0x24   <cpool_insert+0xbc>
> >> >> > cpool_insert+0xe4:              or        %g2, %g1, %g2
> >> >>
> >> >> The above reads a 64-bit "->next" pointer by assembling two adjacent
> >> >> 32-bit fields.  Weird, but arithmetically Ok.
> >> >>
> >> >> Two things strike me:
> >> >> 1. The compiler implements "atomic load of 64-bits" as "load 32 bits,
> >> >> load another 32 bits, combine", which isn't correct in a multithreaded
> >> >> program.  The error could be in the compiler, or in the source code.
> >> >> 2. In the register dump it was obvious that the high bits of an
> >> >> address had been clobbered.
> >> >>
> >> >> My suspicion is that either Sun's compiler is buggy, or Erlang is
> >> >> selecting non thread-safe code in this case.
> >> >>
> >> >> On SPARC64 Linux w/ GCC I get very different code that uses "ldx" for
> >> >> those 64-bit loads, as expected.
> >> >>
> >> >> /Mikael
Reply | Threaded
Open this post in threaded view
|

Re: erlang (rabbitmq) generating core on Solaris SPARC

Pooja Desai
Hi Mikael,

gcc bug mention above is not specific to any platform but problematic disassembly is only generated for solaris sparc. Any idea why only solaris sparc erlang is affected by this?
Actually to minimise impact on testing/sock we are thinking about only rebuilding erlang on solaris sparc for now as issue is only faced on solaris platform. So checking your expert opinion, do you see any problem with this approach? 

Thanks & Regards,
Pooja

On Fri, May 15, 2020 at 1:51 PM Pooja Desai <[hidden email]> wrote:
Thanks Mikael,

As per your suggestion I am rebuilding erlang with newer gcc version. Thanks for helping with this.

Thanks & Regards,
Pooja

On Fri, May 15, 2020 at 3:20 AM Mikael Pettersson <[hidden email]> wrote:
On Thu, May 14, 2020 at 12:09 PM Mikael Pettersson <[hidden email]> wrote:
>
> On Thu, May 14, 2020 at 9:32 AM Pooja Desai <[hidden email]> wrote:
> >
> > Hi Mikael,
> >
> >
> > Please find flies you requested in attachment as erl_files.tar.gz (compressed as facing issue with mail size)
> >
> > Normal build option is:
> >
> > # gcc  -Werror=undef -Werror=implicit -Werror=return-type  -m64 -g  -O3 -fomit-frame-pointer -Ierlang/src/solaris/otp/erts/sparc-sun-solaris2.10  -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -fno-tree-copyrename   -DHAVE_CONFIG_H -Wall -Wstrict-prototypes -Wmissing-prototypes -Wdeclaration-after-statement -DUSE_THREADS -D_THREAD_SAFE -D_REENTRANT -DPOSIX_THREADS -D_POSIX_PTHREAD_SEMANTICS   -Isparc-sun-solaris2.10/opt/smp -Ibeam -Isys/unix -Isys/common -Isparc-sun-solaris2.10 -Izlib  -Ipcre -Ihipe -I../include -I../include/sparc-sun-solaris2.10 -I../include/internal -I../include/internal/sparc-sun-solaris2.10 -c beam/erl_alloc_util.c -o obj/sparc-sun-solaris2.10/opt/smp/erl_alloc_util.o
> >
> > after your suggestion I updated it as below to generate erl_alloc_util file:
> >
> > # gcc  -Werror=undef -Werror=implicit -Werror=return-type  -m64 -g  -O3 -fomit-frame-pointer -Ierlang/src/solaris/otp/erts/sparc-sun-solaris2.10  -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -fno-tree-copyrename   -DHAVE_CONFIG_H -Wall -Wstrict-prototypes -Wmissing-prototypes -Wdeclaration-after-statement -DUSE_THREADS -D_THREAD_SAFE -D_REENTRANT -DPOSIX_THREADS -D_POSIX_PTHREAD_SEMANTICS   -Isparc-sun-solaris2.10/opt/smp -Ibeam -Isys/unix -Isys/common -Isparc-sun-solaris2.10 -Izlib  -Ipcre -Ihipe -I../include -I../include/sparc-sun-solaris2.10 -I../include/internal -I../include/internal/sparc-sun-solaris2.10 -E beam/erl_alloc_util.c -o obj/sparc-sun-solaris2.10/opt/smp/erl_alloc_util.i
> >
> > Also one thing I missed to mention, we are using gcc version 4.9.2 (GCC) for building on solaris SPARC as erlang doesn't support Sun's native compiler.
>
> I've been able to reproduce the non-atomic code for those 64-bit loads
> in cpool_insert() using gcc-4.9 cross compilers to sparc64-linux, but
> gcc-5.5/6.5/7.5/8.4/9.3 all emit correct code as far as I can tell.
>
> So the solution is to upgrade your gcc (I suggest 9.3.0) and rebuild
> your Erlang/OTP VM with that.
>
> /Mikael

I created a reduced test case from erl_alloc.i, and it turns out
Erlang/OTP was hit by
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70424, which affects
gcc-4.9 (all versions) and gcc-5.x (x < 4), on all strict-alignment
targets.

So the recommendation stands: upgrade your gcc.

> > Thanks & Regards,
> > Pooja
> >
> > On Tue, May 12, 2020 at 10:44 PM Mikael Pettersson <[hidden email]> wrote:
> >>
> >> On Tue, May 12, 2020 at 4:18 PM Pooja Desai <[hidden email]> wrote:
> >> >
> >> > Hi,
> >> >
> >> >
> >> >
> >> > Thanks for response Mikael
> >> >
> >> > As per your suggestion I am trying to write similar code to conclude if there is some issue with Solaris SPARC compiler.
> >> >
> >> >
> >> >
> >> > But I have some doubts,
> >> >
> >> > 1.     If there is problem with compiler then we should be able to see this crash everywhere else also, any idea why its only reproduced here?
> >> >
> >> > 2.     As I understand your explanation it reads 64 bits by assembling two adjacent 32 bits fields. Will it really cause problem in multi-threaded program? Considering while context switching to another thread, OS will save current context of the thread (and hence registers) and will bring back when thread is active again.
> >> >
> >> >
> >>
> >> Breaking up a 64-bit load into two 32-bit loads loses atomicity with
> >> any concurrent store into that location, meaning the read may end up
> >> observing a result composed of 32 bit from the old value and 32 bit
> >> from the newly stored value, whereas the code expects to see either
> >> the old or the new, but never this mixture.  This can happen also on a
> >> single-threaded CPU with preemptive multitasking.
> >>
> >> To move forward on the issue, I think you need to recreate the
> >> pre-processed source for erl_alloc_util.c.  To do that:
> >> 1. Compile Erlang/OTP as usual, starting from a pristine source
> >> directory (no left-overs from a previous build, best is to start fresh
> >> somewhere), but pass "V=1" to make.  Save the output from "make" in a
> >> file.
> >> 2. Note the step where it compiles erl_alloc_util.c.
> >> 3. Reexecute that step, but replace any "-c" with "-E" and "-o
> >> erl_alloc_util.o" with "-o erl_alloc_util.i".
> >> 4. Please send this ".i" file, together with the exact build steps and
> >> configuration options you used, and
> >> "erts/sparc-sun-solaris11/config.h" (I'm guessing the file name here)
> >> to me.
> >>
> >> My theory is that Erlang/OTP selects the wrong low-level primitives
> >> for this platform.
> >>
> >>
> >> >
> >> >
> >> > Thanks & Regards,
> >> >
> >> > Pooja
> >> >
> >> >
> >> > On Mon, May 11, 2020 at 10:36 PM Mikael Pettersson <[hidden email]> wrote:
> >> >>
> >> >> Hello Pooja,
> >> >>
> >> >> On Mon, May 11, 2020 at 8:10 AM Pooja Desai <[hidden email]> wrote:
> >> >> >
> >> >> > Hi,
> >> >> >
> >> >> > Facing erlang core issue on solaris SPARC setup while running RabbitMQ
> >> >>
> >> >> This looks like a 64-bit build, but the code doesn't look similar to
> >> >> what I get with gcc-9.3, so I'm assuming you used Sun's compiler?
> >> >>
> >> >>
> >> >> > (dbx) where
> >> >> >
> >> >> > =>[1] cpool_insert(0x1004efd40, 0xffffffff75600000, 0x61850, 0xffffffff75600018, 0x90f, 0x1004effd0), at 0x10006db14
> >> >> >
> >> >> >   [2] abandon_carrier(0x1004efd40, 0xffffffff75600000, 0xffffffff75645ec0, 0xffffffff77d03818, 0x0, 0x6), at 0x10006de3c
> >> >> >
> >> >> >   [3] 17(0x1004efd40, 0xcb3, 0x2, 0xffffffff75645e60, 0x0, 0x1004efd40), at 0x10006e958
> >> >> >
> >> >> >   [4] erts_alcu_check_delayed_dealloc(0x1004efd40, 0x1, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x8000000000000007), at 0x100075244
> >> >> >
> >> >> >   [5] erts_alloc_scheduler_handle_delayed_dealloc(0xffffffff3a82a620, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x100464, 0xffffffff3a82a5d0),
> >> >> >
> >> >> > at 0x1000622c0
> >> >> >
> >> >> >   [6] handle_aux_work(0xffffffff3a8204a0, 0x2, 0x1, 0x2, 0x100400, 0x4e5ce123), at 0x1002a6044
> >> >> >
> >> >> >   [7] erts_schedule(0xffffffff3a820380, 0x9, 0x9, 0xffffffff3a81fc80, 0x2, 0x2), at 0x1002a3040
> >> >> >
> >> >> >   [8] process_main(0x100469, 0xffffffff3a302240, 0xfa0, 0x802a, 0xffffffff38f00438, 0x3), at 0x1002901bc
> >> >> >
> >> >> >   [9] sched_thread_func(0xffffffff3a820380, 0x0, 0x0, 0xffffffff7a911240, 0x100000, 0x1), at 0x100038f08
> >> >> >
> >> >> >   [10] thr_wrapper(0xffffffff7fffc278, 0x0, 0x0, 0x100289d48, 0xffffffff3a820380, 0x100038da0), at 0x100289dc8
> >> >> >
> >> >> >
> >> >> >
> >> >> > This issue is extremely intermittent so I am not able to reproduce it with debug build. But on our test setup I have seen this core twice only for solaris Sparc server for other servers (RHEL, Suse linux, Solarisx86, Windows etc.) with similar test environment things are working fine.
> >> >> >
> >> >> > In two instances when I faced this issue we are restarting Rabbitmq server. i.e. stop RabbitMQ and epmd then run startup script for rabbitmq. This performs 2 operations,
> >> >> >
> >> >> > First ping rabbitmq using "rabbitmqctl ping" to confirm rabbitmq is not already running ( I guess in background this will also start epmd) and then start rabbitmq-server in detached mode.
> >> >> >
> >> >> > Core is generated while starting this demon.
> >> >> >
> >> >> >
> >> >> > I checked code around abandon_carrier("https://github.com/erlang/otp/blame/master/erts/emulator/beam/erl_alloc_util.c") but nothing changed in that area recently. So I am really clueless situation.
> >> >> >
> >> >> > Please le me know if anyone faced similar issue in past or have any idea around this. Using OTP version 22.2 and RabbitMQ version 3.7.23.
> >> >> >
> >> >> > Let me know any further information is required, pasting full core dump information below:
> >> >> >
> >> >> > debugging core file of beam.smp (64-bit) from hostname01
> >> >> > file: temp_dir/erlang/erts-10.6/bin/beam.smp
> >> >> > initial argv:
> >> >> > /temp_dir/erlang/erts-10.6/bin/beam.smp -- -root /temp_dir/
> >> >> > threading model: native threads
> >> >> > status: process terminated by SIGSEGV (Segmentation Fault), addr=
> >> >> > ffffffff004631b0
> >> >>
> >> >> Ok, this tells us the address was unmapped.  (It's not an alignment
> >> >> fault, another common issue on SPARC.)
> >> >>
> >> >>
> >> >> >
> >> >> > C++ symbol demangling enabled
> >> >> >
> >> >> > # stack
> >> >> >
> >> >> > cpool_insert+0xd0(10051c500, ffffffff7a400000, ffffffff7a441de8, ffffffff7c903818, 0, 23)
> >> >> > dealloc_block.part.17+0x1c0(10051c500, cb3, 2, ffffffff7a441d88, 0, 10051c500)
> >> >> > erts_alcu_check_delayed_dealloc+0xe4(10051c500, 1, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 8000000000000007)
> >> >> > erts_alloc_scheduler_handle_delayed_dealloc+0x34(ffffffff3b729c20, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 100464, ffffffff3b729bd0)
> >> >> > handle_aux_work+0xa50(ffffffff3b71faa0, 402, 1, 402, 100400, 42da0c68)
> >> >> > erts_schedule+0x192c(ffffffff3b71f980, 9, 9, ffffffff3b71f280, 402, 2)
> >> >> > process_main+0xc4(100469, ffffffff3b202240, fa0, ffffffff3b71f980, 241, 100294204)
> >> >> > sched_thread_func+0x168(ffffffff3b71f980, 0, 0, ffffffff39401a40, 100000, 1)
> >> >> > thr_wrapper+0x80(ffffffff7fffb318, 0, 0, 100289d48, ffffffff3b71f980, 100038da0)
> >> >> > libc.so.1`_lwp_start(0, 0, 0, 0, 0, 0)
> >> >> >
> >> >> > #############################################################################
> >> >> >
> >> >> > # registers
> >> >> >
> >> >> > %g0 = 0x0000000000000000                 %l0 = 0xffffffff7a4307a0
> >> >> > %g1 = 0xffffffff004631a1                 %l1 = 0x0000000000000000
> >> >> > %g2 = 0x0000000000000000                 %l2 = 0x0000000000000000
> >> >> > %g3 = 0x000000010051c798                 %l3 = 0x0000000000000000
> >> >> > %g4 = 0xffffffff004631a0                 %l4 = 0x0000000000000000
> >> >> > %g5 = 0x00000001004631a0 beam.smp`firstfit_carrier_pool %l5 = 0x0000000000000000
> >> >>
> >> >> This is interesting.  Notice how the low 32-bits 004631a0 show up in
> >> >> three variations:
> >> >> 1. 00000001004631a0 beam.smp`firstfit_carrier_pool (the address of the
> >> >> firstfit_carrier_pool global variable)
> >> >> 2. ffffffff004631a0 (the above, but with the high 32 bits replaced
> >> >> with all-bits-one)
> >> >> 3. ffffffff004631a1 (the above, but with a tag in the low bit)
> >> >>
> >> >> > %g6 = 0x0000000000000000                 %l6 = 0x0000000000000000
> >> >> > %g7 = 0xffffffff39401a40                 %l7 = 0x0000000000000000
> >> >> > %o0 = 0x000000010051c500                 %i0 = 0x000000010051c500
> >> >> > %o1 = 0xffffffff7a400000                 %i1 = 0xffffffff7a400000
> >> >> > %o2 = 0x00000000000676c0                 %i2 = 0xffffffff7a441de8
> >> >> > %o3 = 0xffffffff7a400018                 %i3 = 0xffffffff7c903818
> >> >> > %o4 = 0x00000000000007b9                 %i4 = 0x0000000000000000
> >> >> > %o5 = 0x000000010051c790                 %i5 = 0x0000000000000023
> >> >> > %o6 = 0xffffffff7c902eb1                 %i6 = 0xffffffff7c902f61
> >> >> > %o7 = 0x000000010006de3c abandon_carrier+0x118 %i7 = 0x000000010006e958 dealloc_block.part.17+0x1c0
> >> >> >
> >> >> >  %ccr = 0x44 xcc=nZvc icc=nZvc
> >> >> >    %y = 0x0000000000000000
> >> >> >   %pc = 0x000000010006db14 cpool_insert+0xd0
> >> >> >  %npc = 0x000000010006db18 cpool_insert+0xd4
> >> >> >   %sp = 0xffffffff7c902eb1
> >> >> >   %fp = 0xffffffff7c902f61
> >> >> >
> >> >> >  %asi = 0x82
> >> >> > %fprs = 0x00
> >> >> >
> >> >> > # dissassembly around pc
> >> >> >
> >> >> > cpool_insert+0xa8:              mov       %g1, %g2
> >> >> > cpool_insert+0xac:              ldx       [%g5 + 0x10], %g1
> >> >> > cpool_insert+0xb0:              membar    #LoadLoad|#LoadStore
> >> >> > cpool_insert+0xb4:              ba,pt     %xcc, +0x1c   <cpool_insert+0xd0>
> >> >> > cpool_insert+0xb8:              and       %g1, -0x4, %g4
> >> >>
> >> >> > cpool_insert+0xbc:              membar    #LoadLoad|#LoadStore
> >> >> > cpool_insert+0xc0:              and       %g2, 0x3, %g3
> >> >> > cpool_insert+0xc4:              brz,pn    %g3, +0x1ec   <cpool_insert+0x2b0>
> >> >> > cpool_insert+0xc8:              mov       %g2, %g1
> >> >> > cpool_insert+0xcc:              and       %g1, -0x4, %g4
> >> >> > cpool_insert+0xd0:              ld        [%g4 + 0x10], %g1
> >> >>
> >> >> This is the faulting instruction. We're in the /* Find a predecessor
> >> >> to be, and set mod marker on its next ptr */ loop.
> >> >>
> >> >> > cpool_insert+0xd4:              ld        [%g4 + 0x14], %g2
> >> >> > cpool_insert+0xd8:              sllx      %g1, 0x20, %g1
> >> >> > cpool_insert+0xdc:              cmp       %g5, %g4
> >> >> > cpool_insert+0xe0:              bne,pt    %xcc, -0x24   <cpool_insert+0xbc>
> >> >> > cpool_insert+0xe4:              or        %g2, %g1, %g2
> >> >>
> >> >> The above reads a 64-bit "->next" pointer by assembling two adjacent
> >> >> 32-bit fields.  Weird, but arithmetically Ok.
> >> >>
> >> >> Two things strike me:
> >> >> 1. The compiler implements "atomic load of 64-bits" as "load 32 bits,
> >> >> load another 32 bits, combine", which isn't correct in a multithreaded
> >> >> program.  The error could be in the compiler, or in the source code.
> >> >> 2. In the register dump it was obvious that the high bits of an
> >> >> address had been clobbered.
> >> >>
> >> >> My suspicion is that either Sun's compiler is buggy, or Erlang is
> >> >> selecting non thread-safe code in this case.
> >> >>
> >> >> On SPARC64 Linux w/ GCC I get very different code that uses "ldx" for
> >> >> those 64-bit loads, as expected.
> >> >>
> >> >> /Mikael
Reply | Threaded
Open this post in threaded view
|

Re: erlang (rabbitmq) generating core on Solaris SPARC

Mikael Pettersson-5
On Tue, May 19, 2020 at 6:24 PM Pooja Desai <[hidden email]> wrote:
>
> Hi Mikael,
>
> gcc bug mention above is not specific to any platform but problematic disassembly is only generated for solaris sparc. Any idea why only solaris sparc erlang is affected by this?

As I wrote, the bug affects all strict-alignment targets, and SPARC is
one of those.  Most older RISC designs are strict-alignment.
x86 is not strict-alignment for general purpose instructions, but some
of its vector instructions are.

/Mikael

> Actually to minimise impact on testing/sock we are thinking about only rebuilding erlang on solaris sparc for now as issue is only faced on solaris platform. So checking your expert opinion, do you see any problem with this approach?
>
> Thanks & Regards,
> Pooja
>
> On Fri, May 15, 2020 at 1:51 PM Pooja Desai <[hidden email]> wrote:
>>
>> Thanks Mikael,
>>
>> As per your suggestion I am rebuilding erlang with newer gcc version. Thanks for helping with this.
>>
>> Thanks & Regards,
>> Pooja
>>
>> On Fri, May 15, 2020 at 3:20 AM Mikael Pettersson <[hidden email]> wrote:
>>>
>>> On Thu, May 14, 2020 at 12:09 PM Mikael Pettersson <[hidden email]> wrote:
>>> >
>>> > On Thu, May 14, 2020 at 9:32 AM Pooja Desai <[hidden email]> wrote:
>>> > >
>>> > > Hi Mikael,
>>> > >
>>> > >
>>> > > Please find flies you requested in attachment as erl_files.tar.gz (compressed as facing issue with mail size)
>>> > >
>>> > > Normal build option is:
>>> > >
>>> > > # gcc  -Werror=undef -Werror=implicit -Werror=return-type  -m64 -g  -O3 -fomit-frame-pointer -Ierlang/src/solaris/otp/erts/sparc-sun-solaris2.10  -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -fno-tree-copyrename   -DHAVE_CONFIG_H -Wall -Wstrict-prototypes -Wmissing-prototypes -Wdeclaration-after-statement -DUSE_THREADS -D_THREAD_SAFE -D_REENTRANT -DPOSIX_THREADS -D_POSIX_PTHREAD_SEMANTICS   -Isparc-sun-solaris2.10/opt/smp -Ibeam -Isys/unix -Isys/common -Isparc-sun-solaris2.10 -Izlib  -Ipcre -Ihipe -I../include -I../include/sparc-sun-solaris2.10 -I../include/internal -I../include/internal/sparc-sun-solaris2.10 -c beam/erl_alloc_util.c -o obj/sparc-sun-solaris2.10/opt/smp/erl_alloc_util.o
>>> > >
>>> > > after your suggestion I updated it as below to generate erl_alloc_util file:
>>> > >
>>> > > # gcc  -Werror=undef -Werror=implicit -Werror=return-type  -m64 -g  -O3 -fomit-frame-pointer -Ierlang/src/solaris/otp/erts/sparc-sun-solaris2.10  -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -fno-tree-copyrename   -DHAVE_CONFIG_H -Wall -Wstrict-prototypes -Wmissing-prototypes -Wdeclaration-after-statement -DUSE_THREADS -D_THREAD_SAFE -D_REENTRANT -DPOSIX_THREADS -D_POSIX_PTHREAD_SEMANTICS   -Isparc-sun-solaris2.10/opt/smp -Ibeam -Isys/unix -Isys/common -Isparc-sun-solaris2.10 -Izlib  -Ipcre -Ihipe -I../include -I../include/sparc-sun-solaris2.10 -I../include/internal -I../include/internal/sparc-sun-solaris2.10 -E beam/erl_alloc_util.c -o obj/sparc-sun-solaris2.10/opt/smp/erl_alloc_util.i
>>> > >
>>> > > Also one thing I missed to mention, we are using gcc version 4.9.2 (GCC) for building on solaris SPARC as erlang doesn't support Sun's native compiler.
>>> >
>>> > I've been able to reproduce the non-atomic code for those 64-bit loads
>>> > in cpool_insert() using gcc-4.9 cross compilers to sparc64-linux, but
>>> > gcc-5.5/6.5/7.5/8.4/9.3 all emit correct code as far as I can tell.
>>> >
>>> > So the solution is to upgrade your gcc (I suggest 9.3.0) and rebuild
>>> > your Erlang/OTP VM with that.
>>> >
>>> > /Mikael
>>>
>>> I created a reduced test case from erl_alloc.i, and it turns out
>>> Erlang/OTP was hit by
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70424, which affects
>>> gcc-4.9 (all versions) and gcc-5.x (x < 4), on all strict-alignment
>>> targets.
>>>
>>> So the recommendation stands: upgrade your gcc.
>>>
>>> > > Thanks & Regards,
>>> > > Pooja
>>> > >
>>> > > On Tue, May 12, 2020 at 10:44 PM Mikael Pettersson <[hidden email]> wrote:
>>> > >>
>>> > >> On Tue, May 12, 2020 at 4:18 PM Pooja Desai <[hidden email]> wrote:
>>> > >> >
>>> > >> > Hi,
>>> > >> >
>>> > >> >
>>> > >> >
>>> > >> > Thanks for response Mikael
>>> > >> >
>>> > >> > As per your suggestion I am trying to write similar code to conclude if there is some issue with Solaris SPARC compiler.
>>> > >> >
>>> > >> >
>>> > >> >
>>> > >> > But I have some doubts,
>>> > >> >
>>> > >> > 1.     If there is problem with compiler then we should be able to see this crash everywhere else also, any idea why its only reproduced here?
>>> > >> >
>>> > >> > 2.     As I understand your explanation it reads 64 bits by assembling two adjacent 32 bits fields. Will it really cause problem in multi-threaded program? Considering while context switching to another thread, OS will save current context of the thread (and hence registers) and will bring back when thread is active again.
>>> > >> >
>>> > >> >
>>> > >>
>>> > >> Breaking up a 64-bit load into two 32-bit loads loses atomicity with
>>> > >> any concurrent store into that location, meaning the read may end up
>>> > >> observing a result composed of 32 bit from the old value and 32 bit
>>> > >> from the newly stored value, whereas the code expects to see either
>>> > >> the old or the new, but never this mixture.  This can happen also on a
>>> > >> single-threaded CPU with preemptive multitasking.
>>> > >>
>>> > >> To move forward on the issue, I think you need to recreate the
>>> > >> pre-processed source for erl_alloc_util.c.  To do that:
>>> > >> 1. Compile Erlang/OTP as usual, starting from a pristine source
>>> > >> directory (no left-overs from a previous build, best is to start fresh
>>> > >> somewhere), but pass "V=1" to make.  Save the output from "make" in a
>>> > >> file.
>>> > >> 2. Note the step where it compiles erl_alloc_util.c.
>>> > >> 3. Reexecute that step, but replace any "-c" with "-E" and "-o
>>> > >> erl_alloc_util.o" with "-o erl_alloc_util.i".
>>> > >> 4. Please send this ".i" file, together with the exact build steps and
>>> > >> configuration options you used, and
>>> > >> "erts/sparc-sun-solaris11/config.h" (I'm guessing the file name here)
>>> > >> to me.
>>> > >>
>>> > >> My theory is that Erlang/OTP selects the wrong low-level primitives
>>> > >> for this platform.
>>> > >>
>>> > >>
>>> > >> >
>>> > >> >
>>> > >> > Thanks & Regards,
>>> > >> >
>>> > >> > Pooja
>>> > >> >
>>> > >> >
>>> > >> > On Mon, May 11, 2020 at 10:36 PM Mikael Pettersson <[hidden email]> wrote:
>>> > >> >>
>>> > >> >> Hello Pooja,
>>> > >> >>
>>> > >> >> On Mon, May 11, 2020 at 8:10 AM Pooja Desai <[hidden email]> wrote:
>>> > >> >> >
>>> > >> >> > Hi,
>>> > >> >> >
>>> > >> >> > Facing erlang core issue on solaris SPARC setup while running RabbitMQ
>>> > >> >>
>>> > >> >> This looks like a 64-bit build, but the code doesn't look similar to
>>> > >> >> what I get with gcc-9.3, so I'm assuming you used Sun's compiler?
>>> > >> >>
>>> > >> >>
>>> > >> >> > (dbx) where
>>> > >> >> >
>>> > >> >> > =>[1] cpool_insert(0x1004efd40, 0xffffffff75600000, 0x61850, 0xffffffff75600018, 0x90f, 0x1004effd0), at 0x10006db14
>>> > >> >> >
>>> > >> >> >   [2] abandon_carrier(0x1004efd40, 0xffffffff75600000, 0xffffffff75645ec0, 0xffffffff77d03818, 0x0, 0x6), at 0x10006de3c
>>> > >> >> >
>>> > >> >> >   [3] 17(0x1004efd40, 0xcb3, 0x2, 0xffffffff75645e60, 0x0, 0x1004efd40), at 0x10006e958
>>> > >> >> >
>>> > >> >> >   [4] erts_alcu_check_delayed_dealloc(0x1004efd40, 0x1, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x8000000000000007), at 0x100075244
>>> > >> >> >
>>> > >> >> >   [5] erts_alloc_scheduler_handle_delayed_dealloc(0xffffffff3a82a620, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x100464, 0xffffffff3a82a5d0),
>>> > >> >> >
>>> > >> >> > at 0x1000622c0
>>> > >> >> >
>>> > >> >> >   [6] handle_aux_work(0xffffffff3a8204a0, 0x2, 0x1, 0x2, 0x100400, 0x4e5ce123), at 0x1002a6044
>>> > >> >> >
>>> > >> >> >   [7] erts_schedule(0xffffffff3a820380, 0x9, 0x9, 0xffffffff3a81fc80, 0x2, 0x2), at 0x1002a3040
>>> > >> >> >
>>> > >> >> >   [8] process_main(0x100469, 0xffffffff3a302240, 0xfa0, 0x802a, 0xffffffff38f00438, 0x3), at 0x1002901bc
>>> > >> >> >
>>> > >> >> >   [9] sched_thread_func(0xffffffff3a820380, 0x0, 0x0, 0xffffffff7a911240, 0x100000, 0x1), at 0x100038f08
>>> > >> >> >
>>> > >> >> >   [10] thr_wrapper(0xffffffff7fffc278, 0x0, 0x0, 0x100289d48, 0xffffffff3a820380, 0x100038da0), at 0x100289dc8
>>> > >> >> >
>>> > >> >> >
>>> > >> >> >
>>> > >> >> > This issue is extremely intermittent so I am not able to reproduce it with debug build. But on our test setup I have seen this core twice only for solaris Sparc server for other servers (RHEL, Suse linux, Solarisx86, Windows etc.) with similar test environment things are working fine.
>>> > >> >> >
>>> > >> >> > In two instances when I faced this issue we are restarting Rabbitmq server. i.e. stop RabbitMQ and epmd then run startup script for rabbitmq. This performs 2 operations,
>>> > >> >> >
>>> > >> >> > First ping rabbitmq using "rabbitmqctl ping" to confirm rabbitmq is not already running ( I guess in background this will also start epmd) and then start rabbitmq-server in detached mode.
>>> > >> >> >
>>> > >> >> > Core is generated while starting this demon.
>>> > >> >> >
>>> > >> >> >
>>> > >> >> > I checked code around abandon_carrier("https://github.com/erlang/otp/blame/master/erts/emulator/beam/erl_alloc_util.c") but nothing changed in that area recently. So I am really clueless situation.
>>> > >> >> >
>>> > >> >> > Please le me know if anyone faced similar issue in past or have any idea around this. Using OTP version 22.2 and RabbitMQ version 3.7.23.
>>> > >> >> >
>>> > >> >> > Let me know any further information is required, pasting full core dump information below:
>>> > >> >> >
>>> > >> >> > debugging core file of beam.smp (64-bit) from hostname01
>>> > >> >> > file: temp_dir/erlang/erts-10.6/bin/beam.smp
>>> > >> >> > initial argv:
>>> > >> >> > /temp_dir/erlang/erts-10.6/bin/beam.smp -- -root /temp_dir/
>>> > >> >> > threading model: native threads
>>> > >> >> > status: process terminated by SIGSEGV (Segmentation Fault), addr=
>>> > >> >> > ffffffff004631b0
>>> > >> >>
>>> > >> >> Ok, this tells us the address was unmapped.  (It's not an alignment
>>> > >> >> fault, another common issue on SPARC.)
>>> > >> >>
>>> > >> >>
>>> > >> >> >
>>> > >> >> > C++ symbol demangling enabled
>>> > >> >> >
>>> > >> >> > # stack
>>> > >> >> >
>>> > >> >> > cpool_insert+0xd0(10051c500, ffffffff7a400000, ffffffff7a441de8, ffffffff7c903818, 0, 23)
>>> > >> >> > dealloc_block.part.17+0x1c0(10051c500, cb3, 2, ffffffff7a441d88, 0, 10051c500)
>>> > >> >> > erts_alcu_check_delayed_dealloc+0xe4(10051c500, 1, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 8000000000000007)
>>> > >> >> > erts_alloc_scheduler_handle_delayed_dealloc+0x34(ffffffff3b729c20, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 100464, ffffffff3b729bd0)
>>> > >> >> > handle_aux_work+0xa50(ffffffff3b71faa0, 402, 1, 402, 100400, 42da0c68)
>>> > >> >> > erts_schedule+0x192c(ffffffff3b71f980, 9, 9, ffffffff3b71f280, 402, 2)
>>> > >> >> > process_main+0xc4(100469, ffffffff3b202240, fa0, ffffffff3b71f980, 241, 100294204)
>>> > >> >> > sched_thread_func+0x168(ffffffff3b71f980, 0, 0, ffffffff39401a40, 100000, 1)
>>> > >> >> > thr_wrapper+0x80(ffffffff7fffb318, 0, 0, 100289d48, ffffffff3b71f980, 100038da0)
>>> > >> >> > libc.so.1`_lwp_start(0, 0, 0, 0, 0, 0)
>>> > >> >> >
>>> > >> >> > #############################################################################
>>> > >> >> >
>>> > >> >> > # registers
>>> > >> >> >
>>> > >> >> > %g0 = 0x0000000000000000                 %l0 = 0xffffffff7a4307a0
>>> > >> >> > %g1 = 0xffffffff004631a1                 %l1 = 0x0000000000000000
>>> > >> >> > %g2 = 0x0000000000000000                 %l2 = 0x0000000000000000
>>> > >> >> > %g3 = 0x000000010051c798                 %l3 = 0x0000000000000000
>>> > >> >> > %g4 = 0xffffffff004631a0                 %l4 = 0x0000000000000000
>>> > >> >> > %g5 = 0x00000001004631a0 beam.smp`firstfit_carrier_pool %l5 = 0x0000000000000000
>>> > >> >>
>>> > >> >> This is interesting.  Notice how the low 32-bits 004631a0 show up in
>>> > >> >> three variations:
>>> > >> >> 1. 00000001004631a0 beam.smp`firstfit_carrier_pool (the address of the
>>> > >> >> firstfit_carrier_pool global variable)
>>> > >> >> 2. ffffffff004631a0 (the above, but with the high 32 bits replaced
>>> > >> >> with all-bits-one)
>>> > >> >> 3. ffffffff004631a1 (the above, but with a tag in the low bit)
>>> > >> >>
>>> > >> >> > %g6 = 0x0000000000000000                 %l6 = 0x0000000000000000
>>> > >> >> > %g7 = 0xffffffff39401a40                 %l7 = 0x0000000000000000
>>> > >> >> > %o0 = 0x000000010051c500                 %i0 = 0x000000010051c500
>>> > >> >> > %o1 = 0xffffffff7a400000                 %i1 = 0xffffffff7a400000
>>> > >> >> > %o2 = 0x00000000000676c0                 %i2 = 0xffffffff7a441de8
>>> > >> >> > %o3 = 0xffffffff7a400018                 %i3 = 0xffffffff7c903818
>>> > >> >> > %o4 = 0x00000000000007b9                 %i4 = 0x0000000000000000
>>> > >> >> > %o5 = 0x000000010051c790                 %i5 = 0x0000000000000023
>>> > >> >> > %o6 = 0xffffffff7c902eb1                 %i6 = 0xffffffff7c902f61
>>> > >> >> > %o7 = 0x000000010006de3c abandon_carrier+0x118 %i7 = 0x000000010006e958 dealloc_block.part.17+0x1c0
>>> > >> >> >
>>> > >> >> >  %ccr = 0x44 xcc=nZvc icc=nZvc
>>> > >> >> >    %y = 0x0000000000000000
>>> > >> >> >   %pc = 0x000000010006db14 cpool_insert+0xd0
>>> > >> >> >  %npc = 0x000000010006db18 cpool_insert+0xd4
>>> > >> >> >   %sp = 0xffffffff7c902eb1
>>> > >> >> >   %fp = 0xffffffff7c902f61
>>> > >> >> >
>>> > >> >> >  %asi = 0x82
>>> > >> >> > %fprs = 0x00
>>> > >> >> >
>>> > >> >> > # dissassembly around pc
>>> > >> >> >
>>> > >> >> > cpool_insert+0xa8:              mov       %g1, %g2
>>> > >> >> > cpool_insert+0xac:              ldx       [%g5 + 0x10], %g1
>>> > >> >> > cpool_insert+0xb0:              membar    #LoadLoad|#LoadStore
>>> > >> >> > cpool_insert+0xb4:              ba,pt     %xcc, +0x1c   <cpool_insert+0xd0>
>>> > >> >> > cpool_insert+0xb8:              and       %g1, -0x4, %g4
>>> > >> >>
>>> > >> >> > cpool_insert+0xbc:              membar    #LoadLoad|#LoadStore
>>> > >> >> > cpool_insert+0xc0:              and       %g2, 0x3, %g3
>>> > >> >> > cpool_insert+0xc4:              brz,pn    %g3, +0x1ec   <cpool_insert+0x2b0>
>>> > >> >> > cpool_insert+0xc8:              mov       %g2, %g1
>>> > >> >> > cpool_insert+0xcc:              and       %g1, -0x4, %g4
>>> > >> >> > cpool_insert+0xd0:              ld        [%g4 + 0x10], %g1
>>> > >> >>
>>> > >> >> This is the faulting instruction. We're in the /* Find a predecessor
>>> > >> >> to be, and set mod marker on its next ptr */ loop.
>>> > >> >>
>>> > >> >> > cpool_insert+0xd4:              ld        [%g4 + 0x14], %g2
>>> > >> >> > cpool_insert+0xd8:              sllx      %g1, 0x20, %g1
>>> > >> >> > cpool_insert+0xdc:              cmp       %g5, %g4
>>> > >> >> > cpool_insert+0xe0:              bne,pt    %xcc, -0x24   <cpool_insert+0xbc>
>>> > >> >> > cpool_insert+0xe4:              or        %g2, %g1, %g2
>>> > >> >>
>>> > >> >> The above reads a 64-bit "->next" pointer by assembling two adjacent
>>> > >> >> 32-bit fields.  Weird, but arithmetically Ok.
>>> > >> >>
>>> > >> >> Two things strike me:
>>> > >> >> 1. The compiler implements "atomic load of 64-bits" as "load 32 bits,
>>> > >> >> load another 32 bits, combine", which isn't correct in a multithreaded
>>> > >> >> program.  The error could be in the compiler, or in the source code.
>>> > >> >> 2. In the register dump it was obvious that the high bits of an
>>> > >> >> address had been clobbered.
>>> > >> >>
>>> > >> >> My suspicion is that either Sun's compiler is buggy, or Erlang is
>>> > >> >> selecting non thread-safe code in this case.
>>> > >> >>
>>> > >> >> On SPARC64 Linux w/ GCC I get very different code that uses "ldx" for
>>> > >> >> those 64-bit loads, as expected.
>>> > >> >>
>>> > >> >> /Mikael
Reply | Threaded
Open this post in threaded view
|

Re: erlang (rabbitmq) generating core on Solaris SPARC

Pooja Desai
Ok, thanks for explaining and helping with this issue.

Thanks & Regards,
Pooja

On Wed, May 20, 2020, 4:27 PM Mikael Pettersson <[hidden email]> wrote:
On Tue, May 19, 2020 at 6:24 PM Pooja Desai <[hidden email]> wrote:
>
> Hi Mikael,
>
> gcc bug mention above is not specific to any platform but problematic disassembly is only generated for solaris sparc. Any idea why only solaris sparc erlang is affected by this?

As I wrote, the bug affects all strict-alignment targets, and SPARC is
one of those.  Most older RISC designs are strict-alignment.
x86 is not strict-alignment for general purpose instructions, but some
of its vector instructions are.

/Mikael

> Actually to minimise impact on testing/sock we are thinking about only rebuilding erlang on solaris sparc for now as issue is only faced on solaris platform. So checking your expert opinion, do you see any problem with this approach?
>
> Thanks & Regards,
> Pooja
>
> On Fri, May 15, 2020 at 1:51 PM Pooja Desai <[hidden email]> wrote:
>>
>> Thanks Mikael,
>>
>> As per your suggestion I am rebuilding erlang with newer gcc version. Thanks for helping with this.
>>
>> Thanks & Regards,
>> Pooja
>>
>> On Fri, May 15, 2020 at 3:20 AM Mikael Pettersson <[hidden email]> wrote:
>>>
>>> On Thu, May 14, 2020 at 12:09 PM Mikael Pettersson <[hidden email]> wrote:
>>> >
>>> > On Thu, May 14, 2020 at 9:32 AM Pooja Desai <[hidden email]> wrote:
>>> > >
>>> > > Hi Mikael,
>>> > >
>>> > >
>>> > > Please find flies you requested in attachment as erl_files.tar.gz (compressed as facing issue with mail size)
>>> > >
>>> > > Normal build option is:
>>> > >
>>> > > # gcc  -Werror=undef -Werror=implicit -Werror=return-type  -m64 -g  -O3 -fomit-frame-pointer -Ierlang/src/solaris/otp/erts/sparc-sun-solaris2.10  -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -fno-tree-copyrename   -DHAVE_CONFIG_H -Wall -Wstrict-prototypes -Wmissing-prototypes -Wdeclaration-after-statement -DUSE_THREADS -D_THREAD_SAFE -D_REENTRANT -DPOSIX_THREADS -D_POSIX_PTHREAD_SEMANTICS   -Isparc-sun-solaris2.10/opt/smp -Ibeam -Isys/unix -Isys/common -Isparc-sun-solaris2.10 -Izlib  -Ipcre -Ihipe -I../include -I../include/sparc-sun-solaris2.10 -I../include/internal -I../include/internal/sparc-sun-solaris2.10 -c beam/erl_alloc_util.c -o obj/sparc-sun-solaris2.10/opt/smp/erl_alloc_util.o
>>> > >
>>> > > after your suggestion I updated it as below to generate erl_alloc_util file:
>>> > >
>>> > > # gcc  -Werror=undef -Werror=implicit -Werror=return-type  -m64 -g  -O3 -fomit-frame-pointer -Ierlang/src/solaris/otp/erts/sparc-sun-solaris2.10  -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -fno-tree-copyrename   -DHAVE_CONFIG_H -Wall -Wstrict-prototypes -Wmissing-prototypes -Wdeclaration-after-statement -DUSE_THREADS -D_THREAD_SAFE -D_REENTRANT -DPOSIX_THREADS -D_POSIX_PTHREAD_SEMANTICS   -Isparc-sun-solaris2.10/opt/smp -Ibeam -Isys/unix -Isys/common -Isparc-sun-solaris2.10 -Izlib  -Ipcre -Ihipe -I../include -I../include/sparc-sun-solaris2.10 -I../include/internal -I../include/internal/sparc-sun-solaris2.10 -E beam/erl_alloc_util.c -o obj/sparc-sun-solaris2.10/opt/smp/erl_alloc_util.i
>>> > >
>>> > > Also one thing I missed to mention, we are using gcc version 4.9.2 (GCC) for building on solaris SPARC as erlang doesn't support Sun's native compiler.
>>> >
>>> > I've been able to reproduce the non-atomic code for those 64-bit loads
>>> > in cpool_insert() using gcc-4.9 cross compilers to sparc64-linux, but
>>> > gcc-5.5/6.5/7.5/8.4/9.3 all emit correct code as far as I can tell.
>>> >
>>> > So the solution is to upgrade your gcc (I suggest 9.3.0) and rebuild
>>> > your Erlang/OTP VM with that.
>>> >
>>> > /Mikael
>>>
>>> I created a reduced test case from erl_alloc.i, and it turns out
>>> Erlang/OTP was hit by
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70424, which affects
>>> gcc-4.9 (all versions) and gcc-5.x (x < 4), on all strict-alignment
>>> targets.
>>>
>>> So the recommendation stands: upgrade your gcc.
>>>
>>> > > Thanks & Regards,
>>> > > Pooja
>>> > >
>>> > > On Tue, May 12, 2020 at 10:44 PM Mikael Pettersson <[hidden email]> wrote:
>>> > >>
>>> > >> On Tue, May 12, 2020 at 4:18 PM Pooja Desai <[hidden email]> wrote:
>>> > >> >
>>> > >> > Hi,
>>> > >> >
>>> > >> >
>>> > >> >
>>> > >> > Thanks for response Mikael
>>> > >> >
>>> > >> > As per your suggestion I am trying to write similar code to conclude if there is some issue with Solaris SPARC compiler.
>>> > >> >
>>> > >> >
>>> > >> >
>>> > >> > But I have some doubts,
>>> > >> >
>>> > >> > 1.     If there is problem with compiler then we should be able to see this crash everywhere else also, any idea why its only reproduced here?
>>> > >> >
>>> > >> > 2.     As I understand your explanation it reads 64 bits by assembling two adjacent 32 bits fields. Will it really cause problem in multi-threaded program? Considering while context switching to another thread, OS will save current context of the thread (and hence registers) and will bring back when thread is active again.
>>> > >> >
>>> > >> >
>>> > >>
>>> > >> Breaking up a 64-bit load into two 32-bit loads loses atomicity with
>>> > >> any concurrent store into that location, meaning the read may end up
>>> > >> observing a result composed of 32 bit from the old value and 32 bit
>>> > >> from the newly stored value, whereas the code expects to see either
>>> > >> the old or the new, but never this mixture.  This can happen also on a
>>> > >> single-threaded CPU with preemptive multitasking.
>>> > >>
>>> > >> To move forward on the issue, I think you need to recreate the
>>> > >> pre-processed source for erl_alloc_util.c.  To do that:
>>> > >> 1. Compile Erlang/OTP as usual, starting from a pristine source
>>> > >> directory (no left-overs from a previous build, best is to start fresh
>>> > >> somewhere), but pass "V=1" to make.  Save the output from "make" in a
>>> > >> file.
>>> > >> 2. Note the step where it compiles erl_alloc_util.c.
>>> > >> 3. Reexecute that step, but replace any "-c" with "-E" and "-o
>>> > >> erl_alloc_util.o" with "-o erl_alloc_util.i".
>>> > >> 4. Please send this ".i" file, together with the exact build steps and
>>> > >> configuration options you used, and
>>> > >> "erts/sparc-sun-solaris11/config.h" (I'm guessing the file name here)
>>> > >> to me.
>>> > >>
>>> > >> My theory is that Erlang/OTP selects the wrong low-level primitives
>>> > >> for this platform.
>>> > >>
>>> > >>
>>> > >> >
>>> > >> >
>>> > >> > Thanks & Regards,
>>> > >> >
>>> > >> > Pooja
>>> > >> >
>>> > >> >
>>> > >> > On Mon, May 11, 2020 at 10:36 PM Mikael Pettersson <[hidden email]> wrote:
>>> > >> >>
>>> > >> >> Hello Pooja,
>>> > >> >>
>>> > >> >> On Mon, May 11, 2020 at 8:10 AM Pooja Desai <[hidden email]> wrote:
>>> > >> >> >
>>> > >> >> > Hi,
>>> > >> >> >
>>> > >> >> > Facing erlang core issue on solaris SPARC setup while running RabbitMQ
>>> > >> >>
>>> > >> >> This looks like a 64-bit build, but the code doesn't look similar to
>>> > >> >> what I get with gcc-9.3, so I'm assuming you used Sun's compiler?
>>> > >> >>
>>> > >> >>
>>> > >> >> > (dbx) where
>>> > >> >> >
>>> > >> >> > =>[1] cpool_insert(0x1004efd40, 0xffffffff75600000, 0x61850, 0xffffffff75600018, 0x90f, 0x1004effd0), at 0x10006db14
>>> > >> >> >
>>> > >> >> >   [2] abandon_carrier(0x1004efd40, 0xffffffff75600000, 0xffffffff75645ec0, 0xffffffff77d03818, 0x0, 0x6), at 0x10006de3c
>>> > >> >> >
>>> > >> >> >   [3] 17(0x1004efd40, 0xcb3, 0x2, 0xffffffff75645e60, 0x0, 0x1004efd40), at 0x10006e958
>>> > >> >> >
>>> > >> >> >   [4] erts_alcu_check_delayed_dealloc(0x1004efd40, 0x1, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x8000000000000007), at 0x100075244
>>> > >> >> >
>>> > >> >> >   [5] erts_alloc_scheduler_handle_delayed_dealloc(0xffffffff3a82a620, 0xffffffff77d03a40, 0xffffffff77d03a48, 0xffffffff77d03a44, 0x100464, 0xffffffff3a82a5d0),
>>> > >> >> >
>>> > >> >> > at 0x1000622c0
>>> > >> >> >
>>> > >> >> >   [6] handle_aux_work(0xffffffff3a8204a0, 0x2, 0x1, 0x2, 0x100400, 0x4e5ce123), at 0x1002a6044
>>> > >> >> >
>>> > >> >> >   [7] erts_schedule(0xffffffff3a820380, 0x9, 0x9, 0xffffffff3a81fc80, 0x2, 0x2), at 0x1002a3040
>>> > >> >> >
>>> > >> >> >   [8] process_main(0x100469, 0xffffffff3a302240, 0xfa0, 0x802a, 0xffffffff38f00438, 0x3), at 0x1002901bc
>>> > >> >> >
>>> > >> >> >   [9] sched_thread_func(0xffffffff3a820380, 0x0, 0x0, 0xffffffff7a911240, 0x100000, 0x1), at 0x100038f08
>>> > >> >> >
>>> > >> >> >   [10] thr_wrapper(0xffffffff7fffc278, 0x0, 0x0, 0x100289d48, 0xffffffff3a820380, 0x100038da0), at 0x100289dc8
>>> > >> >> >
>>> > >> >> >
>>> > >> >> >
>>> > >> >> > This issue is extremely intermittent so I am not able to reproduce it with debug build. But on our test setup I have seen this core twice only for solaris Sparc server for other servers (RHEL, Suse linux, Solarisx86, Windows etc.) with similar test environment things are working fine.
>>> > >> >> >
>>> > >> >> > In two instances when I faced this issue we are restarting Rabbitmq server. i.e. stop RabbitMQ and epmd then run startup script for rabbitmq. This performs 2 operations,
>>> > >> >> >
>>> > >> >> > First ping rabbitmq using "rabbitmqctl ping" to confirm rabbitmq is not already running ( I guess in background this will also start epmd) and then start rabbitmq-server in detached mode.
>>> > >> >> >
>>> > >> >> > Core is generated while starting this demon.
>>> > >> >> >
>>> > >> >> >
>>> > >> >> > I checked code around abandon_carrier("https://github.com/erlang/otp/blame/master/erts/emulator/beam/erl_alloc_util.c") but nothing changed in that area recently. So I am really clueless situation.
>>> > >> >> >
>>> > >> >> > Please le me know if anyone faced similar issue in past or have any idea around this. Using OTP version 22.2 and RabbitMQ version 3.7.23.
>>> > >> >> >
>>> > >> >> > Let me know any further information is required, pasting full core dump information below:
>>> > >> >> >
>>> > >> >> > debugging core file of beam.smp (64-bit) from hostname01
>>> > >> >> > file: temp_dir/erlang/erts-10.6/bin/beam.smp
>>> > >> >> > initial argv:
>>> > >> >> > /temp_dir/erlang/erts-10.6/bin/beam.smp -- -root /temp_dir/
>>> > >> >> > threading model: native threads
>>> > >> >> > status: process terminated by SIGSEGV (Segmentation Fault), addr=
>>> > >> >> > ffffffff004631b0
>>> > >> >>
>>> > >> >> Ok, this tells us the address was unmapped.  (It's not an alignment
>>> > >> >> fault, another common issue on SPARC.)
>>> > >> >>
>>> > >> >>
>>> > >> >> >
>>> > >> >> > C++ symbol demangling enabled
>>> > >> >> >
>>> > >> >> > # stack
>>> > >> >> >
>>> > >> >> > cpool_insert+0xd0(10051c500, ffffffff7a400000, ffffffff7a441de8, ffffffff7c903818, 0, 23)
>>> > >> >> > dealloc_block.part.17+0x1c0(10051c500, cb3, 2, ffffffff7a441d88, 0, 10051c500)
>>> > >> >> > erts_alcu_check_delayed_dealloc+0xe4(10051c500, 1, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 8000000000000007)
>>> > >> >> > erts_alloc_scheduler_handle_delayed_dealloc+0x34(ffffffff3b729c20, ffffffff7c903a40, ffffffff7c903a48, ffffffff7c903a44, 100464, ffffffff3b729bd0)
>>> > >> >> > handle_aux_work+0xa50(ffffffff3b71faa0, 402, 1, 402, 100400, 42da0c68)
>>> > >> >> > erts_schedule+0x192c(ffffffff3b71f980, 9, 9, ffffffff3b71f280, 402, 2)
>>> > >> >> > process_main+0xc4(100469, ffffffff3b202240, fa0, ffffffff3b71f980, 241, 100294204)
>>> > >> >> > sched_thread_func+0x168(ffffffff3b71f980, 0, 0, ffffffff39401a40, 100000, 1)
>>> > >> >> > thr_wrapper+0x80(ffffffff7fffb318, 0, 0, 100289d48, ffffffff3b71f980, 100038da0)
>>> > >> >> > libc.so.1`_lwp_start(0, 0, 0, 0, 0, 0)
>>> > >> >> >
>>> > >> >> > #############################################################################
>>> > >> >> >
>>> > >> >> > # registers
>>> > >> >> >
>>> > >> >> > %g0 = 0x0000000000000000                 %l0 = 0xffffffff7a4307a0
>>> > >> >> > %g1 = 0xffffffff004631a1                 %l1 = 0x0000000000000000
>>> > >> >> > %g2 = 0x0000000000000000                 %l2 = 0x0000000000000000
>>> > >> >> > %g3 = 0x000000010051c798                 %l3 = 0x0000000000000000
>>> > >> >> > %g4 = 0xffffffff004631a0                 %l4 = 0x0000000000000000
>>> > >> >> > %g5 = 0x00000001004631a0 beam.smp`firstfit_carrier_pool %l5 = 0x0000000000000000
>>> > >> >>
>>> > >> >> This is interesting.  Notice how the low 32-bits 004631a0 show up in
>>> > >> >> three variations:
>>> > >> >> 1. 00000001004631a0 beam.smp`firstfit_carrier_pool (the address of the
>>> > >> >> firstfit_carrier_pool global variable)
>>> > >> >> 2. ffffffff004631a0 (the above, but with the high 32 bits replaced
>>> > >> >> with all-bits-one)
>>> > >> >> 3. ffffffff004631a1 (the above, but with a tag in the low bit)
>>> > >> >>
>>> > >> >> > %g6 = 0x0000000000000000                 %l6 = 0x0000000000000000
>>> > >> >> > %g7 = 0xffffffff39401a40                 %l7 = 0x0000000000000000
>>> > >> >> > %o0 = 0x000000010051c500                 %i0 = 0x000000010051c500
>>> > >> >> > %o1 = 0xffffffff7a400000                 %i1 = 0xffffffff7a400000
>>> > >> >> > %o2 = 0x00000000000676c0                 %i2 = 0xffffffff7a441de8
>>> > >> >> > %o3 = 0xffffffff7a400018                 %i3 = 0xffffffff7c903818
>>> > >> >> > %o4 = 0x00000000000007b9                 %i4 = 0x0000000000000000
>>> > >> >> > %o5 = 0x000000010051c790                 %i5 = 0x0000000000000023
>>> > >> >> > %o6 = 0xffffffff7c902eb1                 %i6 = 0xffffffff7c902f61
>>> > >> >> > %o7 = 0x000000010006de3c abandon_carrier+0x118 %i7 = 0x000000010006e958 dealloc_block.part.17+0x1c0
>>> > >> >> >
>>> > >> >> >  %ccr = 0x44 xcc=nZvc icc=nZvc
>>> > >> >> >    %y = 0x0000000000000000
>>> > >> >> >   %pc = 0x000000010006db14 cpool_insert+0xd0
>>> > >> >> >  %npc = 0x000000010006db18 cpool_insert+0xd4
>>> > >> >> >   %sp = 0xffffffff7c902eb1
>>> > >> >> >   %fp = 0xffffffff7c902f61
>>> > >> >> >
>>> > >> >> >  %asi = 0x82
>>> > >> >> > %fprs = 0x00
>>> > >> >> >
>>> > >> >> > # dissassembly around pc
>>> > >> >> >
>>> > >> >> > cpool_insert+0xa8:              mov       %g1, %g2
>>> > >> >> > cpool_insert+0xac:              ldx       [%g5 + 0x10], %g1
>>> > >> >> > cpool_insert+0xb0:              membar    #LoadLoad|#LoadStore
>>> > >> >> > cpool_insert+0xb4:              ba,pt     %xcc, +0x1c   <cpool_insert+0xd0>
>>> > >> >> > cpool_insert+0xb8:              and       %g1, -0x4, %g4
>>> > >> >>
>>> > >> >> > cpool_insert+0xbc:              membar    #LoadLoad|#LoadStore
>>> > >> >> > cpool_insert+0xc0:              and       %g2, 0x3, %g3
>>> > >> >> > cpool_insert+0xc4:              brz,pn    %g3, +0x1ec   <cpool_insert+0x2b0>
>>> > >> >> > cpool_insert+0xc8:              mov       %g2, %g1
>>> > >> >> > cpool_insert+0xcc:              and       %g1, -0x4, %g4
>>> > >> >> > cpool_insert+0xd0:              ld        [%g4 + 0x10], %g1
>>> > >> >>
>>> > >> >> This is the faulting instruction. We're in the /* Find a predecessor
>>> > >> >> to be, and set mod marker on its next ptr */ loop.
>>> > >> >>
>>> > >> >> > cpool_insert+0xd4:              ld        [%g4 + 0x14], %g2
>>> > >> >> > cpool_insert+0xd8:              sllx      %g1, 0x20, %g1
>>> > >> >> > cpool_insert+0xdc:              cmp       %g5, %g4
>>> > >> >> > cpool_insert+0xe0:              bne,pt    %xcc, -0x24   <cpool_insert+0xbc>
>>> > >> >> > cpool_insert+0xe4:              or        %g2, %g1, %g2
>>> > >> >>
>>> > >> >> The above reads a 64-bit "->next" pointer by assembling two adjacent
>>> > >> >> 32-bit fields.  Weird, but arithmetically Ok.
>>> > >> >>
>>> > >> >> Two things strike me:
>>> > >> >> 1. The compiler implements "atomic load of 64-bits" as "load 32 bits,
>>> > >> >> load another 32 bits, combine", which isn't correct in a multithreaded
>>> > >> >> program.  The error could be in the compiler, or in the source code.
>>> > >> >> 2. In the register dump it was obvious that the high bits of an
>>> > >> >> address had been clobbered.
>>> > >> >>
>>> > >> >> My suspicion is that either Sun's compiler is buggy, or Erlang is
>>> > >> >> selecting non thread-safe code in this case.
>>> > >> >>
>>> > >> >> On SPARC64 Linux w/ GCC I get very different code that uses "ldx" for
>>> > >> >> those 64-bit loads, as expected.
>>> > >> >>
>>> > >> >> /Mikael