|
Hi
I have a rather large user of YXA that are experiencing problems with beam consuming 100% CPU until they restart it, about once a month. What are people doing to find this kind of bugs? I seem to remember someone writing that they dump state of all processes periodically in their production systems - does anyone has code to that effect to share? I suspect the bug they are experiencing appears to be something similar to http://old.nabble.com/100--CPU-usage-on-Mac-OS-X-Leopard-after-peer-closes-socket-td16731178.html although it might of course be something in YXA... It is not that particular problem although there are similarities, they are running R12 and on BSD (FreeBSD). I don't think my user is really capable of attaching to the running nodes and performing very much fault isolation when this happens, partly because of lack of Erlang wizard status, and also because of urgency to get the node back up running. Any ideas (and especially code ;) ) will be greatly appreciated. /Fredrik ________________________________________________________________ erlang-questions mailing list. See http://www.erlang.org/faq.html erlang-questions (at) erlang.org |
|
Fredrik,
In my experience, the behavior you're describing could be related to process message queue -- say, you have a process which is using selective receive targeting one pattern but ignoring the other. Such a process may accumulate substantial amount of messages over the time and cause increased CPU utilization (as it gets busier and busier processing selective receive). It can certainly help your debugging effort to narrow down the problem scope, so I suggest that you login to the system and issue regs() which may indicate which process has more messages than it ought to. V/ -----Original Message----- From: [hidden email] [mailto:[hidden email]] On Behalf Of Fredrik Thulin Sent: 22 January 2010 11:32 AM To: [hidden email] Subject: [erlang-questions] finding hard to find bugs in production systems Hi I have a rather large user of YXA that are experiencing problems with beam consuming 100% CPU until they restart it, about once a month. What are people doing to find this kind of bugs? I seem to remember someone writing that they dump state of all processes periodically in their production systems - does anyone has code to that effect to share? I suspect the bug they are experiencing appears to be something similar to http://old.nabble.com/100--CPU-usage-on-Mac-OS-X-Leopard-after-peer-closes-s ocket-td16731178.html although it might of course be something in YXA... It is not that particular problem although there are similarities, they are running R12 and on BSD (FreeBSD). I don't think my user is really capable of attaching to the running nodes and performing very much fault isolation when this happens, partly because of lack of Erlang wizard status, and also because of urgency to get the node back up running. Any ideas (and especially code ;) ) will be greatly appreciated. /Fredrik ________________________________________________________________ erlang-questions mailing list. See http://www.erlang.org/faq.html erlang-questions (at) erlang.org ________________________________________________________________ erlang-questions mailing list. See http://www.erlang.org/faq.html erlang-questions (at) erlang.org |
|
In reply to this post by Fredrik Thulin
Fredrik, have you tried using etop? It's easy enough to find the CPU
pig, though identifying exactly what that pig is doing isn't always so easy. /path/to/lib/observer-*/priv/bin/etop -node name@box-to-watch \ [-lines N] [-interval 1] [-tracing off] [...] The "-tracing off" is helpful if the target system is extremely overloaded, since tracing all processes is very intrusive. If the target system is extremely overloaded anyway, it may interfere with the inter-node communication between the target node and the etop node, which can make the reports incomplete or impossible. :-( Give it a try. Oh, and make certain that the correct cookie is available to the etop node. -Scott ________________________________________________________________ erlang-questions mailing list. See http://www.erlang.org/faq.html erlang-questions (at) erlang.org |
|
On Sun, 2010-01-24 at 23:51 -0600, Scott Lystig Fritchie wrote:
> Fredrik, have you tried using etop? It's easy enough to find the CPU > pig, though identifying exactly what that pig is doing isn't always so > easy. No, it was several years since I looked at etop. Thanks for reminding me about it, I should have thought of that myself. How would a bug in some low level function like the one discussed in http://old.nabble.com/100--CPU-usage-on-Mac-OS-X-Leopard-after-peer-closes-socket-td16731178.html look in etop? Does anyone know? Etop isn't ideal though since it requires some knowledge to use, and also might not be practically usable in a situation where everyone wants the production system back to normal state as soon as possible... /Fredrik ________________________________________________________________ erlang-questions mailing list. See http://www.erlang.org/faq.html erlang-questions (at) erlang.org |
|
2010/1/25 Fredrik Thulin <[hidden email]>
> On Sun, 2010-01-24 at 23:51 -0600, Scott Lystig Fritchie wrote: > > Fredrik, have you tried using etop? It's easy enough to find the CPU > > pig, though identifying exactly what that pig is doing isn't always so > > easy. > > No, it was several years since I looked at etop. Thanks for reminding me > about it, I should have thought of that myself. > > How would a bug in some low level function like the one discussed in > > > http://old.nabble.com/100--CPU-usage-on-Mac-OS-X-Leopard-after-peer-closes-socket-td16731178.html > > look in etop? Does anyone know? > > Etop isn't ideal though since it requires some knowledge to use, and > also might not be practically usable in a situation where everyone wants > the production system back to normal state as soon as possible... > > That should force an erl_crash.dump to be written which might give you clues about the 'pig'. Chandru |
| Powered by Nabble | Edit this page |
