Investigate an infinite loop on production servers

classic Classic list List threaded Threaded
32 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Investigate an infinite loop on production servers

Morgan Segalis
Hello everyone,

I'm having a bit of an issue with my production servers.

At some point, it seems to enter into an infinite loop that I can't find, or reproduce by myself on the tests servers.

The bug appear completely random, 1 hour, or 10 hour after restarting the Erlang node.
The loop will eat up all my server's memory in no time, and freeze completely the Erlang node without crashing it. (most of the time)

One time I got an crash dump, and tried to investigate it with cdv, but I didn't get much informations about which process or module was eating up all the memory.
I just know that it crashed because of the crash message : "eheap_alloc: Cannot allocate 6801972448 bytes of memory (of type "heap")."

I'm surely too new to Erlang to investigate something like this with cdv, I really would like some pointers on how I can understand this problem and fix it asap.

If you need any informations about the crash dump, let me know what you need, I'll copy/paste?

I'm using Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:8:8] [async-threads:10] [kernel-poll:true]

Thank you all for your help !

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130523/5f66a3fe/attachment.html>

Reply | Threaded
Open this post in threaded view
|

Investigate an infinite loop on production servers

Bob Ippolito-2
This kind of thing tends to happen when you continuously send messages to a
process faster than it can handle them. The most common case that I've seen
this is where you have a lot of processes communicating with a single
gen_server process. If your server has swap enabled, this may appear to
make the node "freeze completely but not crash".

In the past I've diagnosed this by monitoring the message_queue_len of
registered processes, but I'm sure there are tools that can help do this
for you.


On Wed, May 22, 2013 at 7:00 PM, Morgan Segalis <msegalis> wrote:

> Hello everyone,
>
> I'm having a bit of an issue with my production servers.
>
> At some point, it seems to enter into an infinite loop that I can't find,
> or reproduce by myself on the tests servers.
>
> The bug appear completely random, 1 hour, or 10 hour after restarting the
> Erlang node.
> The loop will eat up all my server's memory in no time, and freeze
> completely the Erlang node without crashing it. (most of the time)
>
> One time I got an crash dump, and tried to investigate it with cdv, but I
> didn't get much informations about which process or module was eating up
> all the memory.
> I just know that it crashed because of the crash message : "eheap_alloc:
> Cannot allocate 6801972448 bytes of memory (of type "heap")."
>
> I'm surely too new to Erlang to investigate something like this with cdv,
> I really would like some pointers on how I can understand this problem and
> fix it asap.
>
> If you need any informations about the crash dump, let me know what you
> need, I'll copy/paste?
>
> I'm using Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:8:8]
> [async-threads:10] [kernel-poll:true]
>
> Thank you all for your help !
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions
> http://erlang.org/mailman/listinfo/erlang-questions
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130522/186ceb11/attachment.html>

Reply | Threaded
Open this post in threaded view
|

Investigate an infinite loop on production servers

Morgan Segalis
Hi,

Generally, when a module is critical, and a lot solicited, I create a pg2 "pool" of supervised gen_server that will join the group, and get the pid using get_closest_pid in order to have multiple process.

Further more, The server has 16GB of RAM, and when the server is starting to get crazy, it only has 1,5 GB tops of RAM used. it really needs to go crazy a long time before using swap I guess, but I don't see it until another node cluster is telling me that the freeze node is timeout.

However, since we are at it? I may have find something really weird looking at my crash dump.
I'm using the emysql application.
My initialization of the emysql application is pretty basic :

application:start(emysql),
emysql:add_pool(my_db,
            30,
            "login",
            "password",
            "my.db-host.com",
            3306,
            "table",
            latin1)

Has you can see, I only have 30 connections asked in the pool. However in the crash dump here's what I have found in the fun table :

Module Uniq Index Address Native_address Refc

emysql_util 8432855 1 0x00007f1d4f9f6f00 3476
emysql_util 8432855 0 0x00007f1d4f9f7218 3476
emysql_util 8432855 3 0x00007f1d4f9f6e48 2
emysql_util 8432855 2 0x00007f1d4f9f6ea8 1
emysql 79898780 0 0x00007f1d4f9b56f8 841

Is that something normal to have with only 30 connections in one pool ?

Thank you all.


Le 23 mai 2013 ? 04:21, Bob Ippolito <bob> a ?crit :

> This kind of thing tends to happen when you continuously send messages to a process faster than it can handle them. The most common case that I've seen this is where you have a lot of processes communicating with a single gen_server process. If your server has swap enabled, this may appear to make the node "freeze completely but not crash".
>
> In the past I've diagnosed this by monitoring the message_queue_len of registered processes, but I'm sure there are tools that can help do this for you.
>
>
> On Wed, May 22, 2013 at 7:00 PM, Morgan Segalis <msegalis> wrote:
> Hello everyone,
>
> I'm having a bit of an issue with my production servers.
>
> At some point, it seems to enter into an infinite loop that I can't find, or reproduce by myself on the tests servers.
>
> The bug appear completely random, 1 hour, or 10 hour after restarting the Erlang node.
> The loop will eat up all my server's memory in no time, and freeze completely the Erlang node without crashing it. (most of the time)
>
> One time I got an crash dump, and tried to investigate it with cdv, but I didn't get much informations about which process or module was eating up all the memory.
> I just know that it crashed because of the crash message : "eheap_alloc: Cannot allocate 6801972448 bytes of memory (of type "heap")."
>
> I'm surely too new to Erlang to investigate something like this with cdv, I really would like some pointers on how I can understand this problem and fix it asap.
>
> If you need any informations about the crash dump, let me know what you need, I'll copy/paste?
>
> I'm using Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:8:8] [async-threads:10] [kernel-poll:true]
>
> Thank you all for your help !
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions
> http://erlang.org/mailman/listinfo/erlang-questions
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130523/2bf2a225/attachment.html>

Reply | Threaded
Open this post in threaded view
|

Investigate an infinite loop on production servers

Morgan Segalis
In reply to this post by Morgan Segalis
Hi,

Unfortunately, I'm not using RabbitMQ.

as for the garbage collection, Since the server is having 16GB of RAM, and the erlang node is using 1,5 GB when everything works fine, garbage collection should not be here an issue?

Le 23 mai 2013 ? 05:14, Yogish Baliga <baliga> a ?crit :

> I saw this message in our RabbitMQ server. Investigation lead me to the garbage collection. It happened only once. After restart everything seems to be fine.
>
>
>
> On May 22, 2013, at 19:00, Morgan Segalis <msegalis> wrote:
>
>> Hello everyone,
>>
>> I'm having a bit of an issue with my production servers.
>>
>> At some point, it seems to enter into an infinite loop that I can't find, or reproduce by myself on the tests servers.
>>
>> The bug appear completely random, 1 hour, or 10 hour after restarting the Erlang node.
>> The loop will eat up all my server's memory in no time, and freeze completely the Erlang node without crashing it. (most of the time)
>>
>> One time I got an crash dump, and tried to investigate it with cdv, but I didn't get much informations about which process or module was eating up all the memory.
>> I just know that it crashed because of the crash message : "eheap_alloc: Cannot allocate 6801972448 bytes of memory (of type "heap")."
>>
>> I'm surely too new to Erlang to investigate something like this with cdv, I really would like some pointers on how I can understand this problem and fix it asap.
>>
>> If you need any informations about the crash dump, let me know what you need, I'll copy/paste?
>>
>> I'm using Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:8:8] [async-threads:10] [kernel-poll:true]
>>
>> Thank you all for your help !
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions
>> http://erlang.org/mailman/listinfo/erlang-questions

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130523/122785f0/attachment.html>

Reply | Threaded
Open this post in threaded view
|

Investigate an infinite loop on production servers

Dmitry Kolesnikov
In reply to this post by Bob Ippolito-2
Hello,

I would aree with Bob about most probable root cause. You can use entop to check message queue length and used memory per-process.

Best Regards,
Dmitry >-|-|-*>


On 23.5.2013, at 5.21, Bob Ippolito <bob> wrote:

> This kind of thing tends to happen when you continuously send messages to a process faster than it can handle them. The most common case that I've seen this is where you have a lot of processes communicating with a single gen_server process. If your server has swap enabled, this may appear to make the node "freeze completely but not crash".
>
> In the past I've diagnosed this by monitoring the message_queue_len of registered processes, but I'm sure there are tools that can help do this for you.
>
>
> On Wed, May 22, 2013 at 7:00 PM, Morgan Segalis <msegalis> wrote:
>> Hello everyone,
>>
>> I'm having a bit of an issue with my production servers.
>>
>> At some point, it seems to enter into an infinite loop that I can't find, or reproduce by myself on the tests servers.
>>
>> The bug appear completely random, 1 hour, or 10 hour after restarting the Erlang node.
>> The loop will eat up all my server's memory in no time, and freeze completely the Erlang node without crashing it. (most of the time)
>>
>> One time I got an crash dump, and tried to investigate it with cdv, but I didn't get much informations about which process or module was eating up all the memory.
>> I just know that it crashed because of the crash message : "eheap_alloc: Cannot allocate 6801972448 bytes of memory (of type "heap")."
>>
>> I'm surely too new to Erlang to investigate something like this with cdv, I really would like some pointers on how I can understand this problem and fix it asap.
>>
>> If you need any informations about the crash dump, let me know what you need, I'll copy/paste?
>>
>> I'm using Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:8:8] [async-threads:10] [kernel-poll:true]
>>
>> Thank you all for your help !
>>
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions
>> http://erlang.org/mailman/listinfo/erlang-questions
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions
> http://erlang.org/mailman/listinfo/erlang-questions
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130523/49503c97/attachment.html>

Reply | Threaded
Open this post in threaded view
|

Investigate an infinite loop on production servers

Vance Shipley-2
In reply to this post by Morgan Segalis
On Thu, May 23, 2013 at 04:00:07AM +0200, Morgan Segalis wrote:
}  I'm having a bit of an issue with my production servers.

You will find that etop is your friend:

        http://www.erlang.org/doc/apps/observer/etop_ug.html

Run etop from the command line and sort on the column you're
interested in.  To watch memory usage:

        etop -node tiger -sort memory

This will list the processes by memory size in decreasing order.
This shows you the memory hogs.  Watch it as it starts to get
into trouble and you should see where the memory is getting used.

As Bob points out the most common problem is that a process's
inbox will start to fill up.  Once this starts happening it's
the beginning of the end.  Another process may start eating up
memory and the node may crash because it has requested more than
is available bt the root cause was that one process not having
time to service the messages at the rate they are received.

To watch for message queue lengths:

        etop -node tiger -sort msg_q

The above will list the processes in decreasing order of inbox
size.  They should all be zero, and sometimes one, normally.  If
you have a problem you'll see one process stay at the top and it's
message queue length will start to grow over time.

--
        -Vance

Reply | Threaded
Open this post in threaded view
|

Investigate an infinite loop on production servers

Morgan Segalis
I have launch the etop on my computer monitoring the production server? hoping that I will see something wrong !

Thank you for your help so far (to All).

I'll come back to you as soon as I have more information with etop.

Morgan.

Le 23 mai 2013 ? 07:38, Vance Shipley <vances> a ?crit :

> On Thu, May 23, 2013 at 04:00:07AM +0200, Morgan Segalis wrote:
> }  I'm having a bit of an issue with my production servers.
>
> You will find that etop is your friend:
>
> http://www.erlang.org/doc/apps/observer/etop_ug.html
>
> Run etop from the command line and sort on the column you're
> interested in.  To watch memory usage:
>
> etop -node tiger -sort memory
>
> This will list the processes by memory size in decreasing order.
> This shows you the memory hogs.  Watch it as it starts to get
> into trouble and you should see where the memory is getting used.
>
> As Bob points out the most common problem is that a process's
> inbox will start to fill up.  Once this starts happening it's
> the beginning of the end.  Another process may start eating up
> memory and the node may crash because it has requested more than
> is available bt the root cause was that one process not having
> time to service the messages at the rate they are received.
>
> To watch for message queue lengths:
>
> etop -node tiger -sort msg_q
>
> The above will list the processes in decreasing order of inbox
> size.  They should all be zero, and sometimes one, normally.  If
> you have a problem you'll see one process stay at the top and it's
> message queue length will start to grow over time.
>
> --
> -Vance


Reply | Threaded
Open this post in threaded view
|

Investigate an infinite loop on production servers

Morgan Segalis
Apparently I'm monitoring my own node?

Does someone know how to monitor a external cluster node with etop ?

Le 23 mai 2013 ? 11:13, Morgan Segalis <msegalis> a ?crit :

> I have launch the etop on my computer monitoring the production server? hoping that I will see something wrong !
>
> Thank you for your help so far (to All).
>
> I'll come back to you as soon as I have more information with etop.
>
> Morgan.
>
> Le 23 mai 2013 ? 07:38, Vance Shipley <vances> a ?crit :
>
>> On Thu, May 23, 2013 at 04:00:07AM +0200, Morgan Segalis wrote:
>> }  I'm having a bit of an issue with my production servers.
>>
>> You will find that etop is your friend:
>>
>> http://www.erlang.org/doc/apps/observer/etop_ug.html
>>
>> Run etop from the command line and sort on the column you're
>> interested in.  To watch memory usage:
>>
>> etop -node tiger -sort memory
>>
>> This will list the processes by memory size in decreasing order.
>> This shows you the memory hogs.  Watch it as it starts to get
>> into trouble and you should see where the memory is getting used.
>>
>> As Bob points out the most common problem is that a process's
>> inbox will start to fill up.  Once this starts happening it's
>> the beginning of the end.  Another process may start eating up
>> memory and the node may crash because it has requested more than
>> is available bt the root cause was that one process not having
>> time to service the messages at the rate they are received.
>>
>> To watch for message queue lengths:
>>
>> etop -node tiger -sort msg_q
>>
>> The above will list the processes in decreasing order of inbox
>> size.  They should all be zero, and sometimes one, normally.  If
>> you have a problem you'll see one process stay at the top and it's
>> message queue length will start to grow over time.
>>
>> --
>> -Vance
>


Reply | Threaded
Open this post in threaded view
|

Investigate an infinite loop on production servers

Morgan Segalis
Nevermind I got it?

However I do not get a lot of information?

most of process is proc_lib:ini_p/5

Le 23 mai 2013 ? 11:23, Morgan Segalis <msegalis> a ?crit :

> Apparently I'm monitoring my own node?
>
> Does someone know how to monitor a external cluster node with etop ?
>
> Le 23 mai 2013 ? 11:13, Morgan Segalis <msegalis> a ?crit :
>
>> I have launch the etop on my computer monitoring the production server? hoping that I will see something wrong !
>>
>> Thank you for your help so far (to All).
>>
>> I'll come back to you as soon as I have more information with etop.
>>
>> Morgan.
>>
>> Le 23 mai 2013 ? 07:38, Vance Shipley <vances> a ?crit :
>>
>>> On Thu, May 23, 2013 at 04:00:07AM +0200, Morgan Segalis wrote:
>>> }  I'm having a bit of an issue with my production servers.
>>>
>>> You will find that etop is your friend:
>>>
>>> http://www.erlang.org/doc/apps/observer/etop_ug.html
>>>
>>> Run etop from the command line and sort on the column you're
>>> interested in.  To watch memory usage:
>>>
>>> etop -node tiger -sort memory
>>>
>>> This will list the processes by memory size in decreasing order.
>>> This shows you the memory hogs.  Watch it as it starts to get
>>> into trouble and you should see where the memory is getting used.
>>>
>>> As Bob points out the most common problem is that a process's
>>> inbox will start to fill up.  Once this starts happening it's
>>> the beginning of the end.  Another process may start eating up
>>> memory and the node may crash because it has requested more than
>>> is available bt the root cause was that one process not having
>>> time to service the messages at the rate they are received.
>>>
>>> To watch for message queue lengths:
>>>
>>> etop -node tiger -sort msg_q
>>>
>>> The above will list the processes in decreasing order of inbox
>>> size.  They should all be zero, and sometimes one, normally.  If
>>> you have a problem you'll see one process stay at the top and it's
>>> message queue length will start to grow over time.
>>>
>>> --
>>> -Vance
>>
>


Reply | Threaded
Open this post in threaded view
|

Investigate an infinite loop on production servers

Dmitry Kolesnikov
which means that you are using proc_lib heavily...
Are those top process with reductions, message queue size or heap?

Try to connect into node and gather more info about those processes using
erlang:process_info(?) or sys:get_status(?)

- Dmitry

On May 23, 2013, at 12:35 PM, Morgan Segalis <msegalis> wrote:

> Nevermind I got it?
>
> However I do not get a lot of information?
>
> most of process is proc_lib:ini_p/5
>
> Le 23 mai 2013 ? 11:23, Morgan Segalis <msegalis> a ?crit :
>
>> Apparently I'm monitoring my own node?
>>
>> Does someone know how to monitor a external cluster node with etop ?
>>
>> Le 23 mai 2013 ? 11:13, Morgan Segalis <msegalis> a ?crit :
>>
>>> I have launch the etop on my computer monitoring the production server? hoping that I will see something wrong !
>>>
>>> Thank you for your help so far (to All).
>>>
>>> I'll come back to you as soon as I have more information with etop.
>>>
>>> Morgan.
>>>
>>> Le 23 mai 2013 ? 07:38, Vance Shipley <vances> a ?crit :
>>>
>>>> On Thu, May 23, 2013 at 04:00:07AM +0200, Morgan Segalis wrote:
>>>> }  I'm having a bit of an issue with my production servers.
>>>>
>>>> You will find that etop is your friend:
>>>>
>>>> http://www.erlang.org/doc/apps/observer/etop_ug.html
>>>>
>>>> Run etop from the command line and sort on the column you're
>>>> interested in.  To watch memory usage:
>>>>
>>>> etop -node tiger -sort memory
>>>>
>>>> This will list the processes by memory size in decreasing order.
>>>> This shows you the memory hogs.  Watch it as it starts to get
>>>> into trouble and you should see where the memory is getting used.
>>>>
>>>> As Bob points out the most common problem is that a process's
>>>> inbox will start to fill up.  Once this starts happening it's
>>>> the beginning of the end.  Another process may start eating up
>>>> memory and the node may crash because it has requested more than
>>>> is available bt the root cause was that one process not having
>>>> time to service the messages at the rate they are received.
>>>>
>>>> To watch for message queue lengths:
>>>>
>>>> etop -node tiger -sort msg_q
>>>>
>>>> The above will list the processes in decreasing order of inbox
>>>> size.  They should all be zero, and sometimes one, normally.  If
>>>> you have a problem you'll see one process stay at the top and it's
>>>> message queue length will start to grow over time.
>>>>
>>>> --
>>>> -Vance
>>>
>>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions
> http://erlang.org/mailman/listinfo/erlang-questions


Reply | Threaded
Open this post in threaded view
|

Investigate an infinite loop on production servers

Morgan Segalis
For more information here's what my erlang node is doing :

It is an instant messaging server, each client connected is a process spawn automatically by a supervisor?

Every process spawned is monitored and started by a supervisor?

I have made a little function a while back, getting all processes and removing the processes inited at the beginning?

Here's what it gives me when everything works fine :

Dict: {dict,16,16,16,8,80,48,
            {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
            {{[[{{connector_serv,init,1},[connector_suprc42,connector,<0.42.0>]}|548]],
              [],
              [[{{supervisor,connector_sup,1},[connector,<0.42.0>]}|3],
               [{{connector_serv,init,1},[connector_supssl,connector,<0.42.0>]}|1460],
               [{{supervisor,casserl_sup,1},[connector,<0.42.0>]}|1],
               [{{supervisor,pushiphone_sup,1},[connector,<0.42.0>]}|2],
               [{{pushiphone,init,1},['pushiphone-lite',connector,<0.42.0>]}|3],
               [{{supervisor,clientpool_sup,1},[connector,<0.42.0>]}|1]],
              [],
              [[{{clientpool,init,1},[clientpool_sup,connector,<0.42.0>]}|1],
               [undefined|4]],
              [],
              [[{{supervisor,connector,1},[<0.42.0>]}|1],
               [{{casserl_serv,init,1},[casserl_sup,connector,<0.42.0>]}|50]],
              [],[],[],
              [[{{connector_serv,init,1},[connector_suprc4,connector,<0.42.0>]}|472],
               [{{ssl_connection,init,1},
                 [ssl_connection_sup,ssl_sup,<0.51.0>]}|
                1366],
               [{unknown,unknown}|3]],
              [],[],
              [[{{pushiphone,init,1},['pushiphone-full',connector,<0.42.0>]}|3]],
              [],
              [[{{pg2,init,1},[kernel_safe_sup,kernel_sup,<0.10.0>]}|1]]}}}
ok



Le 23 mai 2013 ? 11:50, Dmitry Kolesnikov <dmkolesnikov> a ?crit :

> which means that you are using proc_lib heavily...
> Are those top process with reductions, message queue size or heap?
>
> Try to connect into node and gather more info about those processes using
> erlang:process_info(?) or sys:get_status(?)
>
> - Dmitry
>
> On May 23, 2013, at 12:35 PM, Morgan Segalis <msegalis> wrote:
>
>> Nevermind I got it?
>>
>> However I do not get a lot of information?
>>
>> most of process is proc_lib:ini_p/5
>>
>> Le 23 mai 2013 ? 11:23, Morgan Segalis <msegalis> a ?crit :
>>
>>> Apparently I'm monitoring my own node?
>>>
>>> Does someone know how to monitor a external cluster node with etop ?
>>>
>>> Le 23 mai 2013 ? 11:13, Morgan Segalis <msegalis> a ?crit :
>>>
>>>> I have launch the etop on my computer monitoring the production server? hoping that I will see something wrong !
>>>>
>>>> Thank you for your help so far (to All).
>>>>
>>>> I'll come back to you as soon as I have more information with etop.
>>>>
>>>> Morgan.
>>>>
>>>> Le 23 mai 2013 ? 07:38, Vance Shipley <vances> a ?crit :
>>>>
>>>>> On Thu, May 23, 2013 at 04:00:07AM +0200, Morgan Segalis wrote:
>>>>> }  I'm having a bit of an issue with my production servers.
>>>>>
>>>>> You will find that etop is your friend:
>>>>>
>>>>> http://www.erlang.org/doc/apps/observer/etop_ug.html
>>>>>
>>>>> Run etop from the command line and sort on the column you're
>>>>> interested in.  To watch memory usage:
>>>>>
>>>>> etop -node tiger -sort memory
>>>>>
>>>>> This will list the processes by memory size in decreasing order.
>>>>> This shows you the memory hogs.  Watch it as it starts to get
>>>>> into trouble and you should see where the memory is getting used.
>>>>>
>>>>> As Bob points out the most common problem is that a process's
>>>>> inbox will start to fill up.  Once this starts happening it's
>>>>> the beginning of the end.  Another process may start eating up
>>>>> memory and the node may crash because it has requested more than
>>>>> is available bt the root cause was that one process not having
>>>>> time to service the messages at the rate they are received.
>>>>>
>>>>> To watch for message queue lengths:
>>>>>
>>>>> etop -node tiger -sort msg_q
>>>>>
>>>>> The above will list the processes in decreasing order of inbox
>>>>> size.  They should all be zero, and sometimes one, normally.  If
>>>>> you have a problem you'll see one process stay at the top and it's
>>>>> message queue length will start to grow over time.
>>>>>
>>>>> --
>>>>> -Vance
>>>>
>>>
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions
>> http://erlang.org/mailman/listinfo/erlang-questions
>


Reply | Threaded
Open this post in threaded view
|

Investigate an infinite loop on production servers

Morgan Segalis
When a client is connecting, yes, I'm using start_child.

The supervising strategy is : simple_one_for_one, temporary, worker

Morgan.


Le 23 mai 2013 ? 12:13, Dmitry Kolesnikov <dmkolesnikov> a ?crit :

>
> On May 23, 2013, at 1:04 PM, Morgan Segalis <msegalis> wrote:
>
>> Every process spawned is monitored and started by a supervisor?
>
> Do you use start_child to spawn a new process? If so, do you clean it up?
> What is supervising strategy?
>
> - Dmitry

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130523/05855763/attachment.html>

Reply | Threaded
Open this post in threaded view
|

Investigate an infinite loop on production servers

Dmitry Kolesnikov
In reply to this post by Morgan Segalis

On May 23, 2013, at 1:04 PM, Morgan Segalis <msegalis> wrote:

> I have made a little function a while back, getting all processes and removing the processes inited at the beginning?

Could you please elaborate on that? Why you are not satisfied with supervisor?

- Dmitry
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130523/70c15811/attachment.html>

Reply | Threaded
Open this post in threaded view
|

Investigate an infinite loop on production servers

Morgan Segalis
No, I was talking about the function I made to investigate which processes I have created, which gives me this output :

Dict: {dict,16,16,16,8,80,48,
           {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
           {{[[{{connector_serv,init,1},[connector_suprc42,connector,<0.42.0>]}|548]],
             [],
             [[{{supervisor,connector_sup,1},[connector,<0.42.0>]}|3],
              [{{connector_serv,init,1},[connector_supssl,connector,<0.42.0>]}|1460],
              [{{supervisor,casserl_sup,1},[connector,<0.42.0>]}|1],
              [{{supervisor,pushiphone_sup,1},[connector,<0.42.0>]}|2],
              [{{pushiphone,init,1},['pushiphone-lite',connector,<0.42.0>]}|3],
              [{{supervisor,clientpool_sup,1},[connector,<0.42.0>]}|1]],
             [],
             [[{{clientpool,init,1},[clientpool_sup,connector,<0.42.0>]}|1],
              [undefined|4]],
             [],
             [[{{supervisor,connector,1},[<0.42.0>]}|1],
              [{{casserl_serv,init,1},[casserl_sup,connector,<0.42.0>]}|50]],
             [],[],[],
             [[{{connector_serv,init,1},[connector_suprc4,connector,<0.42.0>]}|472],
              [{{ssl_connection,init,1},
                [ssl_connection_sup,ssl_sup,<0.51.0>]}|
               1366],
              [{unknown,unknown}|3]],
             [],[],
             [[{{pushiphone,init,1},['pushiphone-full',connector,<0.42.0>]}|3]],
             [],
             [[{{pg2,init,1},[kernel_safe_sup,kernel_sup,<0.10.0>]}|1]]}}}
ok

I'm very satisfied with supervisor, and I don't think to have the expertise tweaking it...

Le 23 mai 2013 ? 14:19, Dmitry Kolesnikov <dmkolesnikov> a ?crit :

>
> On May 23, 2013, at 1:04 PM, Morgan Segalis <msegalis> wrote:
>
>> I have made a little function a while back, getting all processes and removing the processes inited at the beginning?
>
> Could you please elaborate on that? Why you are not satisfied with supervisor?
>
> - Dmitry

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130523/02df5f9b/attachment.html>

Reply | Threaded
Open this post in threaded view
|

Investigate an infinite loop on production servers

Dmitry Kolesnikov
Right, you do not have many processes. Same time you goes out of memory?

Unfortunately, I had no time play around with R16B at production?
Could it be some issue with SSL, I re-call there was some complains in the list?

I would use entop to spot the process that has either too much reductions, queue len or heap.
Once you know they pid you can dig more info about them using erlang:process_info(?) and/or sys:get:status(?)

BTW, What erlang:memory() says on you production node?

- Dmitry

On May 23, 2013, at 3:25 PM, Morgan Segalis <msegalis> wrote:

> No, I was talking about the function I made to investigate which processes I have created, which gives me this output :
>
> Dict: {dict,16,16,16,8,80,48,
>            {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
>            {{[[{{connector_serv,init,1},[connector_suprc42,connector,<0.42.0>]}|548]],
>              [],
>              [[{{supervisor,connector_sup,1},[connector,<0.42.0>]}|3],
>               [{{connector_serv,init,1},[connector_supssl,connector,<0.42.0>]}|1460],
>               [{{supervisor,casserl_sup,1},[connector,<0.42.0>]}|1],
>               [{{supervisor,pushiphone_sup,1},[connector,<0.42.0>]}|2],
>               [{{pushiphone,init,1},['pushiphone-lite',connector,<0.42.0>]}|3],
>               [{{supervisor,clientpool_sup,1},[connector,<0.42.0>]}|1]],
>              [],
>              [[{{clientpool,init,1},[clientpool_sup,connector,<0.42.0>]}|1],
>               [undefined|4]],
>              [],
>              [[{{supervisor,connector,1},[<0.42.0>]}|1],
>               [{{casserl_serv,init,1},[casserl_sup,connector,<0.42.0>]}|50]],
>              [],[],[],
>              [[{{connector_serv,init,1},[connector_suprc4,connector,<0.42.0>]}|472],
>               [{{ssl_connection,init,1},
>                 [ssl_connection_sup,ssl_sup,<0.51.0>]}|
>                1366],
>               [{unknown,unknown}|3]],
>              [],[],
>              [[{{pushiphone,init,1},['pushiphone-full',connector,<0.42.0>]}|3]],
>              [],
>              [[{{pg2,init,1},[kernel_safe_sup,kernel_sup,<0.10.0>]}|1]]}}}
> ok
>
> I'm very satisfied with supervisor, and I don't think to have the expertise tweaking it...
>
> Le 23 mai 2013 ? 14:19, Dmitry Kolesnikov <dmkolesnikov> a ?crit :
>
>>
>> On May 23, 2013, at 1:04 PM, Morgan Segalis <msegalis> wrote:
>>
>>> I have made a little function a while back, getting all processes and removing the processes inited at the beginning?
>>
>> Could you please elaborate on that? Why you are not satisfied with supervisor?
>>
>> - Dmitry
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130523/1a71abad/attachment.html>

Reply | Threaded
Open this post in threaded view
|

Investigate an infinite loop on production servers

Morgan Segalis
So I should go back to R15B ?

erlang:memory() gives me

[{total,1525779584},
 {processes,1272881427},
 {processes_used,1272789743},
 {system,252898157},
 {atom,372217},
 {atom_used,346096},
 {binary,148093608},
 {code,8274446},
 {ets,1546832}]


But keep in mind that right now, there is no infinite loop, or memory issue at this exact time?
It will be more interesting to have that when the VM is asking for 14GB of memory, but when it does, the console is unresponsive, so I can't get anything then.

Le 23 mai 2013 ? 14:39, Dmitry Kolesnikov <dmkolesnikov> a ?crit :

> Right, you do not have many processes. Same time you goes out of memory?
>
> Unfortunately, I had no time play around with R16B at production?
> Could it be some issue with SSL, I re-call there was some complains in the list?
>
> I would use entop to spot the process that has either too much reductions, queue len or heap.
> Once you know they pid you can dig more info about them using erlang:process_info(?) and/or sys:get:status(?)
>
> BTW, What erlang:memory() says on you production node?
>
> - Dmitry
>
> On May 23, 2013, at 3:25 PM, Morgan Segalis <msegalis> wrote:
>
>> No, I was talking about the function I made to investigate which processes I have created, which gives me this output :
>>
>> Dict: {dict,16,16,16,8,80,48,
>>            {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
>>            {{[[{{connector_serv,init,1},[connector_suprc42,connector,<0.42.0>]}|548]],
>>              [],
>>              [[{{supervisor,connector_sup,1},[connector,<0.42.0>]}|3],
>>               [{{connector_serv,init,1},[connector_supssl,connector,<0.42.0>]}|1460],
>>               [{{supervisor,casserl_sup,1},[connector,<0.42.0>]}|1],
>>               [{{supervisor,pushiphone_sup,1},[connector,<0.42.0>]}|2],
>>               [{{pushiphone,init,1},['pushiphone-lite',connector,<0.42.0>]}|3],
>>               [{{supervisor,clientpool_sup,1},[connector,<0.42.0>]}|1]],
>>              [],
>>              [[{{clientpool,init,1},[clientpool_sup,connector,<0.42.0>]}|1],
>>               [undefined|4]],
>>              [],
>>              [[{{supervisor,connector,1},[<0.42.0>]}|1],
>>               [{{casserl_serv,init,1},[casserl_sup,connector,<0.42.0>]}|50]],
>>              [],[],[],
>>              [[{{connector_serv,init,1},[connector_suprc4,connector,<0.42.0>]}|472],
>>               [{{ssl_connection,init,1},
>>                 [ssl_connection_sup,ssl_sup,<0.51.0>]}|
>>                1366],
>>               [{unknown,unknown}|3]],
>>              [],[],
>>              [[{{pushiphone,init,1},['pushiphone-full',connector,<0.42.0>]}|3]],
>>              [],
>>              [[{{pg2,init,1},[kernel_safe_sup,kernel_sup,<0.10.0>]}|1]]}}}
>> ok
>>
>> I'm very satisfied with supervisor, and I don't think to have the expertise tweaking it...
>>
>> Le 23 mai 2013 ? 14:19, Dmitry Kolesnikov <dmkolesnikov> a ?crit :
>>
>>>
>>> On May 23, 2013, at 1:04 PM, Morgan Segalis <msegalis> wrote:
>>>
>>>> I have made a little function a while back, getting all processes and removing the processes inited at the beginning?
>>>
>>> Could you please elaborate on that? Why you are not satisfied with supervisor?
>>>
>>> - Dmitry
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130523/79e2458d/attachment.html>

Reply | Threaded
Open this post in threaded view
|

Investigate an infinite loop on production servers

Vance Shipley-2
Keep etop running and capture the output to a file (e.g. etop ... | tee
stop.log). After it gets into trouble look back and see what was happening
beforehand.
 On May 23, 2013 6:16 PM, "Morgan Segalis" <msegalis> wrote:

> So I should go back to R15B ?
>
> erlang:memory() gives me
>
> [{total,1525779584},
>  {processes,1272881427},
>  {processes_used,1272789743},
>  {system,252898157},
>  {atom,372217},
>  {atom_used,346096},
>  {binary,148093608},
>  {code,8274446},
>  {ets,1546832}]
>
>
> But keep in mind that right now, there is no infinite loop, or memory
> issue at this exact time?
> It will be more interesting to have that when the VM is asking for 14GB of
> memory, but when it does, the console is unresponsive, so I can't get
> anything then.
>
> Le 23 mai 2013 ? 14:39, Dmitry Kolesnikov <dmkolesnikov> a
> ?crit :
>
> Right, you do not have many processes. Same time you goes out of memory?
>
> Unfortunately, I had no time play around with R16B at production?
> Could it be some issue with SSL, I re-call there was some complains in the
> list?
>
> I would use entop to spot the process that has either too much reductions,
> queue len or heap.
> Once you know they pid you can dig more info about them using
> erlang:process_info(?) and/or sys:get:status(?)
>
> BTW, What erlang:memory() says on you production node?
>
> - Dmitry
>
> On May 23, 2013, at 3:25 PM, Morgan Segalis <msegalis> wrote:
>
> No, I was talking about the function I made to investigate which processes
> I have created, which gives me this output :
>
> Dict: {dict,16,16,16,8,80,48,
>            {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
>
>            {{[[{{connector_serv,init,1},[connector_suprc42,connector,<0.42.0>]}|548]],
>              [],
>              [[{{supervisor,connector_sup,1},[connector,<0.42.0>]}|3],
>
>               [{{connector_serv,init,1},[connector_supssl,connector,<0.42.0>]}|1460],
>               [{{supervisor,casserl_sup,1},[connector,<0.42.0>]}|1],
>               [{{supervisor,pushiphone_sup,1},[connector,<0.42.0>]}|2],
>
>               [{{pushiphone,init,1},['pushiphone-lite',connector,<0.42.0>]}|3],
>               [{{supervisor,clientpool_sup,1},[connector,<0.42.0>]}|1]],
>              [],
>
>              [[{{clientpool,init,1},[clientpool_sup,connector,<0.42.0>]}|1],
>               [undefined|4]],
>              [],
>              [[{{supervisor,connector,1},[<0.42.0>]}|1],
>
>               [{{casserl_serv,init,1},[casserl_sup,connector,<0.42.0>]}|50]],
>              [],[],[],
>
>              [[{{connector_serv,init,1},[connector_suprc4,connector,<0.42.0>]}|472],
>               [{{ssl_connection,init,1},
>                 [ssl_connection_sup,ssl_sup,<0.51.0>]}|
>                1366],
>               [{unknown,unknown}|3]],
>              [],[],
>
>              [[{{pushiphone,init,1},['pushiphone-full',connector,<0.42.0>]}|3]],
>              [],
>              [[{{pg2,init,1},[kernel_safe_sup,kernel_sup,<0.10.0>]}|1]]}}}
> ok
>
> I'm very satisfied with supervisor, and I don't think to have the
> expertise tweaking it...
>
> Le 23 mai 2013 ? 14:19, Dmitry Kolesnikov <dmkolesnikov> a
> ?crit :
>
>
> On May 23, 2013, at 1:04 PM, Morgan Segalis <msegalis> wrote:
>
> I have made a little function a while back, getting all processes and
> removing the processes inited at the beginning?
>
>
> Could you please elaborate on that? Why you are not satisfied with
> supervisor?
>
> - Dmitry
>
>
>
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions
> http://erlang.org/mailman/listinfo/erlang-questions
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130523/3436adf8/attachment.html>

Reply | Threaded
Open this post in threaded view
|

Investigate an infinite loop on production servers

Morgan Segalis
Yeah that what I'm doing right now, but of course, when I'm monitoring it, it won't crash, only when I sleep !!

I get back to the Erlang list as soon as I have more informations about this.

Thank you all !

Morgan.

Le 23 mai 2013 ? 16:30, Vance Shipley <vances> a ?crit :

> Keep etop running and capture the output to a file (e.g. etop ... | tee stop.log). After it gets into trouble look back and see what was happening beforehand.
> On May 23, 2013 6:16 PM, "Morgan Segalis" <msegalis> wrote:
> So I should go back to R15B ?
>
> erlang:memory() gives me
>
> [{total,1525779584},
>  {processes,1272881427},
>  {processes_used,1272789743},
>  {system,252898157},
>  {atom,372217},
>  {atom_used,346096},
>  {binary,148093608},
>  {code,8274446},
>  {ets,1546832}]
>
>
> But keep in mind that right now, there is no infinite loop, or memory issue at this exact time?
> It will be more interesting to have that when the VM is asking for 14GB of memory, but when it does, the console is unresponsive, so I can't get anything then.
>
> Le 23 mai 2013 ? 14:39, Dmitry Kolesnikov <dmkolesnikov> a ?crit :
>
>> Right, you do not have many processes. Same time you goes out of memory?
>>
>> Unfortunately, I had no time play around with R16B at production?
>> Could it be some issue with SSL, I re-call there was some complains in the list?
>>
>> I would use entop to spot the process that has either too much reductions, queue len or heap.
>> Once you know they pid you can dig more info about them using erlang:process_info(?) and/or sys:get:status(?)
>>
>> BTW, What erlang:memory() says on you production node?
>>
>> - Dmitry
>>
>> On May 23, 2013, at 3:25 PM, Morgan Segalis <msegalis> wrote:
>>
>>> No, I was talking about the function I made to investigate which processes I have created, which gives me this output :
>>>
>>> Dict: {dict,16,16,16,8,80,48,
>>>            {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
>>>            {{[[{{connector_serv,init,1},[connector_suprc42,connector,<0.42.0>]}|548]],
>>>              [],
>>>              [[{{supervisor,connector_sup,1},[connector,<0.42.0>]}|3],
>>>               [{{connector_serv,init,1},[connector_supssl,connector,<0.42.0>]}|1460],
>>>               [{{supervisor,casserl_sup,1},[connector,<0.42.0>]}|1],
>>>               [{{supervisor,pushiphone_sup,1},[connector,<0.42.0>]}|2],
>>>               [{{pushiphone,init,1},['pushiphone-lite',connector,<0.42.0>]}|3],
>>>               [{{supervisor,clientpool_sup,1},[connector,<0.42.0>]}|1]],
>>>              [],
>>>              [[{{clientpool,init,1},[clientpool_sup,connector,<0.42.0>]}|1],
>>>               [undefined|4]],
>>>              [],
>>>              [[{{supervisor,connector,1},[<0.42.0>]}|1],
>>>               [{{casserl_serv,init,1},[casserl_sup,connector,<0.42.0>]}|50]],
>>>              [],[],[],
>>>              [[{{connector_serv,init,1},[connector_suprc4,connector,<0.42.0>]}|472],
>>>               [{{ssl_connection,init,1},
>>>                 [ssl_connection_sup,ssl_sup,<0.51.0>]}|
>>>                1366],
>>>               [{unknown,unknown}|3]],
>>>              [],[],
>>>              [[{{pushiphone,init,1},['pushiphone-full',connector,<0.42.0>]}|3]],
>>>              [],
>>>              [[{{pg2,init,1},[kernel_safe_sup,kernel_sup,<0.10.0>]}|1]]}}}
>>> ok
>>>
>>> I'm very satisfied with supervisor, and I don't think to have the expertise tweaking it...
>>>
>>> Le 23 mai 2013 ? 14:19, Dmitry Kolesnikov <dmkolesnikov> a ?crit :
>>>
>>>>
>>>> On May 23, 2013, at 1:04 PM, Morgan Segalis <msegalis> wrote:
>>>>
>>>>> I have made a little function a while back, getting all processes and removing the processes inited at the beginning?
>>>>
>>>> Could you please elaborate on that? Why you are not satisfied with supervisor?
>>>>
>>>> - Dmitry
>>>
>>
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions
> http://erlang.org/mailman/listinfo/erlang-questions
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130523/c0f77824/attachment.html>

Reply | Threaded
Open this post in threaded view
|

Investigate an infinite loop on production servers

Morgan Segalis
Ok, it finally got into the infinite loop?

And of course, the node on which I was running etop could not give me anymore since it got disconnected from the production node.

So back to square one? no way to investigate correctly so far :-/

Morgan.

Le 23 mai 2013 ? 16:34, Morgan Segalis <msegalis> a ?crit :

> Yeah that what I'm doing right now, but of course, when I'm monitoring it, it won't crash, only when I sleep !!
>
> I get back to the Erlang list as soon as I have more informations about this.
>
> Thank you all !
>
> Morgan.
>
> Le 23 mai 2013 ? 16:30, Vance Shipley <vances> a ?crit :
>
>> Keep etop running and capture the output to a file (e.g. etop ... | tee stop.log). After it gets into trouble look back and see what was happening beforehand.
>> On May 23, 2013 6:16 PM, "Morgan Segalis" <msegalis> wrote:
>> So I should go back to R15B ?
>>
>> erlang:memory() gives me
>>
>> [{total,1525779584},
>>  {processes,1272881427},
>>  {processes_used,1272789743},
>>  {system,252898157},
>>  {atom,372217},
>>  {atom_used,346096},
>>  {binary,148093608},
>>  {code,8274446},
>>  {ets,1546832}]
>>
>>
>> But keep in mind that right now, there is no infinite loop, or memory issue at this exact time?
>> It will be more interesting to have that when the VM is asking for 14GB of memory, but when it does, the console is unresponsive, so I can't get anything then.
>>
>> Le 23 mai 2013 ? 14:39, Dmitry Kolesnikov <dmkolesnikov> a ?crit :
>>
>>> Right, you do not have many processes. Same time you goes out of memory?
>>>
>>> Unfortunately, I had no time play around with R16B at production?
>>> Could it be some issue with SSL, I re-call there was some complains in the list?
>>>
>>> I would use entop to spot the process that has either too much reductions, queue len or heap.
>>> Once you know they pid you can dig more info about them using erlang:process_info(?) and/or sys:get:status(?)
>>>
>>> BTW, What erlang:memory() says on you production node?
>>>
>>> - Dmitry
>>>
>>> On May 23, 2013, at 3:25 PM, Morgan Segalis <msegalis> wrote:
>>>
>>>> No, I was talking about the function I made to investigate which processes I have created, which gives me this output :
>>>>
>>>> Dict: {dict,16,16,16,8,80,48,
>>>>            {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
>>>>            {{[[{{connector_serv,init,1},[connector_suprc42,connector,<0.42.0>]}|548]],
>>>>              [],
>>>>              [[{{supervisor,connector_sup,1},[connector,<0.42.0>]}|3],
>>>>               [{{connector_serv,init,1},[connector_supssl,connector,<0.42.0>]}|1460],
>>>>               [{{supervisor,casserl_sup,1},[connector,<0.42.0>]}|1],
>>>>               [{{supervisor,pushiphone_sup,1},[connector,<0.42.0>]}|2],
>>>>               [{{pushiphone,init,1},['pushiphone-lite',connector,<0.42.0>]}|3],
>>>>               [{{supervisor,clientpool_sup,1},[connector,<0.42.0>]}|1]],
>>>>              [],
>>>>              [[{{clientpool,init,1},[clientpool_sup,connector,<0.42.0>]}|1],
>>>>               [undefined|4]],
>>>>              [],
>>>>              [[{{supervisor,connector,1},[<0.42.0>]}|1],
>>>>               [{{casserl_serv,init,1},[casserl_sup,connector,<0.42.0>]}|50]],
>>>>              [],[],[],
>>>>              [[{{connector_serv,init,1},[connector_suprc4,connector,<0.42.0>]}|472],
>>>>               [{{ssl_connection,init,1},
>>>>                 [ssl_connection_sup,ssl_sup,<0.51.0>]}|
>>>>                1366],
>>>>               [{unknown,unknown}|3]],
>>>>              [],[],
>>>>              [[{{pushiphone,init,1},['pushiphone-full',connector,<0.42.0>]}|3]],
>>>>              [],
>>>>              [[{{pg2,init,1},[kernel_safe_sup,kernel_sup,<0.10.0>]}|1]]}}}
>>>> ok
>>>>
>>>> I'm very satisfied with supervisor, and I don't think to have the expertise tweaking it...
>>>>
>>>> Le 23 mai 2013 ? 14:19, Dmitry Kolesnikov <dmkolesnikov> a ?crit :
>>>>
>>>>>
>>>>> On May 23, 2013, at 1:04 PM, Morgan Segalis <msegalis> wrote:
>>>>>
>>>>>> I have made a little function a while back, getting all processes and removing the processes inited at the beginning?
>>>>>
>>>>> Could you please elaborate on that? Why you are not satisfied with supervisor?
>>>>>
>>>>> - Dmitry
>>>>
>>>
>>
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130523/edd03e45/attachment.html>

Reply | Threaded
Open this post in threaded view
|

Investigate an infinite loop on production servers

Dmitry Kolesnikov
You system definitely leaking some resources :-/
 - Check number of used FD(s) may be you exceeded limit there
 - What was overall system memory / cpu utilisation before crash?
 - Check how many connections you got before crash, may be you can reproduce it at dev

- Dmitry

On May 24, 2013, at 12:13 AM, Morgan Segalis <msegalis> wrote:

> Ok, it finally got into the infinite loop?
>
> And of course, the node on which I was running etop could not give me anymore since it got disconnected from the production node.
>
> So back to square one? no way to investigate correctly so far :-/
>
> Morgan.
>
> Le 23 mai 2013 ? 16:34, Morgan Segalis <msegalis> a ?crit :
>
>> Yeah that what I'm doing right now, but of course, when I'm monitoring it, it won't crash, only when I sleep !!
>>
>> I get back to the Erlang list as soon as I have more informations about this.
>>
>> Thank you all !
>>
>> Morgan.
>>
>> Le 23 mai 2013 ? 16:30, Vance Shipley <vances> a ?crit :
>>
>>> Keep etop running and capture the output to a file (e.g. etop ... | tee stop.log). After it gets into trouble look back and see what was happening beforehand.
>>> On May 23, 2013 6:16 PM, "Morgan Segalis" <msegalis> wrote:
>>> So I should go back to R15B ?
>>>
>>> erlang:memory() gives me
>>>
>>> [{total,1525779584},
>>>  {processes,1272881427},
>>>  {processes_used,1272789743},
>>>  {system,252898157},
>>>  {atom,372217},
>>>  {atom_used,346096},
>>>  {binary,148093608},
>>>  {code,8274446},
>>>  {ets,1546832}]
>>>
>>>
>>> But keep in mind that right now, there is no infinite loop, or memory issue at this exact time?
>>> It will be more interesting to have that when the VM is asking for 14GB of memory, but when it does, the console is unresponsive, so I can't get anything then.
>>>
>>> Le 23 mai 2013 ? 14:39, Dmitry Kolesnikov <dmkolesnikov> a ?crit :
>>>
>>>> Right, you do not have many processes. Same time you goes out of memory?
>>>>
>>>> Unfortunately, I had no time play around with R16B at production?
>>>> Could it be some issue with SSL, I re-call there was some complains in the list?
>>>>
>>>> I would use entop to spot the process that has either too much reductions, queue len or heap.
>>>> Once you know they pid you can dig more info about them using erlang:process_info(?) and/or sys:get:status(?)
>>>>
>>>> BTW, What erlang:memory() says on you production node?
>>>>
>>>> - Dmitry
>>>>
>>>> On May 23, 2013, at 3:25 PM, Morgan Segalis <msegalis> wrote:
>>>>
>>>>> No, I was talking about the function I made to investigate which processes I have created, which gives me this output :
>>>>>
>>>>> Dict: {dict,16,16,16,8,80,48,
>>>>>            {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
>>>>>            {{[[{{connector_serv,init,1},[connector_suprc42,connector,<0.42.0>]}|548]],
>>>>>              [],
>>>>>              [[{{supervisor,connector_sup,1},[connector,<0.42.0>]}|3],
>>>>>               [{{connector_serv,init,1},[connector_supssl,connector,<0.42.0>]}|1460],
>>>>>               [{{supervisor,casserl_sup,1},[connector,<0.42.0>]}|1],
>>>>>               [{{supervisor,pushiphone_sup,1},[connector,<0.42.0>]}|2],
>>>>>               [{{pushiphone,init,1},['pushiphone-lite',connector,<0.42.0>]}|3],
>>>>>               [{{supervisor,clientpool_sup,1},[connector,<0.42.0>]}|1]],
>>>>>              [],
>>>>>              [[{{clientpool,init,1},[clientpool_sup,connector,<0.42.0>]}|1],
>>>>>               [undefined|4]],
>>>>>              [],
>>>>>              [[{{supervisor,connector,1},[<0.42.0>]}|1],
>>>>>               [{{casserl_serv,init,1},[casserl_sup,connector,<0.42.0>]}|50]],
>>>>>              [],[],[],
>>>>>              [[{{connector_serv,init,1},[connector_suprc4,connector,<0.42.0>]}|472],
>>>>>               [{{ssl_connection,init,1},
>>>>>                 [ssl_connection_sup,ssl_sup,<0.51.0>]}|
>>>>>                1366],
>>>>>               [{unknown,unknown}|3]],
>>>>>              [],[],
>>>>>              [[{{pushiphone,init,1},['pushiphone-full',connector,<0.42.0>]}|3]],
>>>>>              [],
>>>>>              [[{{pg2,init,1},[kernel_safe_sup,kernel_sup,<0.10.0>]}|1]]}}}
>>>>> ok
>>>>>
>>>>> I'm very satisfied with supervisor, and I don't think to have the expertise tweaking it...
>>>>>
>>>>> Le 23 mai 2013 ? 14:19, Dmitry Kolesnikov <dmkolesnikov> a ?crit :
>>>>>
>>>>>>
>>>>>> On May 23, 2013, at 1:04 PM, Morgan Segalis <msegalis> wrote:
>>>>>>
>>>>>>> I have made a little function a while back, getting all processes and removing the processes inited at the beginning?
>>>>>>
>>>>>> Could you please elaborate on that? Why you are not satisfied with supervisor?
>>>>>>
>>>>>> - Dmitry
>>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> erlang-questions mailing list
>>> erlang-questions
>>> http://erlang.org/mailman/listinfo/erlang-questions
>>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20130524/2721463d/attachment.html>

12