Quantcast

finding hard to find bugs in production systems

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

finding hard to find bugs in production systems

Fredrik Thulin
Hi

I have a rather large user of YXA that are experiencing problems with
beam consuming 100% CPU until they restart it, about once a month.

What are people doing to find this kind of bugs? I seem to remember
someone writing that they dump state of all processes periodically in
their production systems - does anyone has code to that effect to share?

I suspect the bug they are experiencing appears to be something similar
to

http://old.nabble.com/100--CPU-usage-on-Mac-OS-X-Leopard-after-peer-closes-socket-td16731178.html

although it might of course be something in YXA... It is not that
particular problem although there are similarities, they are running R12
and on BSD (FreeBSD).

I don't think my user is really capable of attaching to the running
nodes and performing very much fault isolation when this happens, partly
because of lack of Erlang wizard status, and also because of urgency to
get the node back up running.

Any ideas (and especially code ;) ) will be greatly appreciated.

/Fredrik



________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

RE: finding hard to find bugs in production systems

Valentin Micic-2
Fredrik,

In my experience, the behavior you're describing could be related to process
message queue -- say, you have a process which is using selective receive
targeting one pattern but ignoring the other. Such a process may accumulate
substantial amount of messages over the time and cause increased CPU
utilization (as it gets busier and busier processing selective receive).
It can certainly help your debugging effort to narrow down the problem
scope, so I suggest that you login to the system and issue regs() which may
indicate which process has more messages than it ought to.

V/

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On
Behalf Of Fredrik Thulin
Sent: 22 January 2010 11:32 AM
To: [hidden email]
Subject: [erlang-questions] finding hard to find bugs in production systems

Hi

I have a rather large user of YXA that are experiencing problems with
beam consuming 100% CPU until they restart it, about once a month.

What are people doing to find this kind of bugs? I seem to remember
someone writing that they dump state of all processes periodically in
their production systems - does anyone has code to that effect to share?

I suspect the bug they are experiencing appears to be something similar
to

http://old.nabble.com/100--CPU-usage-on-Mac-OS-X-Leopard-after-peer-closes-s
ocket-td16731178.html

although it might of course be something in YXA... It is not that
particular problem although there are similarities, they are running R12
and on BSD (FreeBSD).

I don't think my user is really capable of attaching to the running
nodes and performing very much fault isolation when this happens, partly
because of lack of Erlang wizard status, and also because of urgency to
get the node back up running.

Any ideas (and especially code ;) ) will be greatly appreciated.

/Fredrik



________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org


________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: finding hard to find bugs in production systems

Scott Lystig Fritchie
In reply to this post by Fredrik Thulin
Fredrik, have you tried using etop?  It's easy enough to find the CPU
pig, though identifying exactly what that pig is doing isn't always so
easy.

    /path/to/lib/observer-*/priv/bin/etop -node name@box-to-watch \
        [-lines N] [-interval 1] [-tracing off] [...]

The "-tracing off" is helpful if the target system is extremely
overloaded, since tracing all processes is very intrusive.  If the
target system is extremely overloaded anyway, it may interfere with the
inter-node communication between the target node and the etop node,
which can make the reports incomplete or impossible.  :-(  Give it a
try.  Oh, and make certain that the correct cookie is available to the
etop node.

-Scott

________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: finding hard to find bugs in production systems

Fredrik Thulin
On Sun, 2010-01-24 at 23:51 -0600, Scott Lystig Fritchie wrote:
> Fredrik, have you tried using etop?  It's easy enough to find the CPU
> pig, though identifying exactly what that pig is doing isn't always so
> easy.

No, it was several years since I looked at etop. Thanks for reminding me
about it, I should have thought of that myself.

How would a bug in some low level function like the one discussed in

http://old.nabble.com/100--CPU-usage-on-Mac-OS-X-Leopard-after-peer-closes-socket-td16731178.html

look in etop? Does anyone know?

Etop isn't ideal though since it requires some knowledge to use, and
also might not be practically usable in a situation where everyone wants
the production system back to normal state as soon as possible...

/Fredrik



________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: finding hard to find bugs in production systems

Chandru-4
2010/1/25 Fredrik Thulin <[hidden email]>

> On Sun, 2010-01-24 at 23:51 -0600, Scott Lystig Fritchie wrote:
> > Fredrik, have you tried using etop?  It's easy enough to find the CPU
> > pig, though identifying exactly what that pig is doing isn't always so
> > easy.
>
> No, it was several years since I looked at etop. Thanks for reminding me
> about it, I should have thought of that myself.
>
> How would a bug in some low level function like the one discussed in
>
>
> http://old.nabble.com/100--CPU-usage-on-Mac-OS-X-Leopard-after-peer-closes-socket-td16731178.html
>
> look in etop? Does anyone know?
>
> Etop isn't ideal though since it requires some knowledge to use, and
> also might not be practically usable in a situation where everyone wants
> the production system back to normal state as soon as possible...
>
>
Maybe you could ask your client to shutdown the node using erlang:halt/1.
That should force an erl_crash.dump to be written which might give you clues
about the 'pig'.

Chandru
Loading...