Identifying causes for high CPU usage

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Identifying causes for high CPU usage

Eli Iser
I'm running several ejabberd nodes in a cluster. Sadly, it uses a very old Erlang version - R13B04, on top of CentOS with 4 cores. I am using 5 async-threads with kernel poll enabled.

The old Erlang version might perhaps explain the problem I saw, but I believe that the question is general enough.

At one time the CPU usage of some of the nodes started climbing (from about 20% to near 100%). This happened only once and doesn't happen anymore (after a full cluster power off and power on again).

Connecting to the nodes with a remote shell showed almost everything was the same to nodes that were un-affected:

* cprof and eprof - showed the same usage patterns.
* Listing top most erlang:process_info on all processes for reductions and message_queue_len showed similar patterns.
* erlang:statistics - looked the same for wall_clock and reductions.

The only concrete difference between the affected and un-affected nodes were:

* run_queue - affected nodes had a run queue of several dozens (less than 100), while un-affected nodes had 0 (always). Since run_queues is undocumented (at least I didn't see it in the documentation), I didn't run it at the time of the problem.
* runtime - affected nodes progress the runtime at about 150% of the wall time, while un-affected nodes progress the runtime at about 20% of the wall time.

All of this made me suspect some NIFs and/or ports taking a long time to complete, taking CPU time but not increasing reductions and calls.

Looking at the various profilers for Erlang, I couldn't find anything that can profile NIFs or even reveal that they indeed do take a long time to finish.

Is there a better way to diagnose a high CPU usage issue?

Cheers,
Eli

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Identifying causes for high CPU usage

Michael Uvarov-2
Hi,

Try lock counter during load-testing
http://erlang.org/doc/apps/tools/lcnt_chapter.html
It's very unsafe to run it in prod.

Check locking time of the port drivers.Sometimes setting ERL_DRV_FLAG_USE_PORT_LOCKING flag can help.

All long operations in ports cause the same scheduling problems you heard about NIFs.
Unless these operations are async and driver is well-written.
Scheduling/locking problems cause random latency spikes in all time metrics. So, check 95 percentile for this.

Regards,
Michael


On 23 June 2016 at 12:35, Eli Iser <[hidden email]> wrote:
I'm running several ejabberd nodes in a cluster. Sadly, it uses a very old Erlang version - R13B04, on top of CentOS with 4 cores. I am using 5 async-threads with kernel poll enabled.

The old Erlang version might perhaps explain the problem I saw, but I believe that the question is general enough.

At one time the CPU usage of some of the nodes started climbing (from about 20% to near 100%). This happened only once and doesn't happen anymore (after a full cluster power off and power on again).

Connecting to the nodes with a remote shell showed almost everything was the same to nodes that were un-affected:

* cprof and eprof - showed the same usage patterns.
* Listing top most erlang:process_info on all processes for reductions and message_queue_len showed similar patterns.
* erlang:statistics - looked the same for wall_clock and reductions.

The only concrete difference between the affected and un-affected nodes were:

* run_queue - affected nodes had a run queue of several dozens (less than 100), while un-affected nodes had 0 (always). Since run_queues is undocumented (at least I didn't see it in the documentation), I didn't run it at the time of the problem.
* runtime - affected nodes progress the runtime at about 150% of the wall time, while un-affected nodes progress the runtime at about 20% of the wall time.

All of this made me suspect some NIFs and/or ports taking a long time to complete, taking CPU time but not increasing reductions and calls.

Looking at the various profilers for Erlang, I couldn't find anything that can profile NIFs or even reveal that they indeed do take a long time to finish.

Is there a better way to diagnose a high CPU usage issue?

Cheers,
Eli

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions




--
С уважением,
Уваров Михаил.
Best regards,
Uvarov Michael

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Identifying causes for high CPU usage

Jesper Louis Andersen-2
In reply to this post by Eli Iser

On Thu, Jun 23, 2016 at 12:35 PM, Eli Iser <[hidden email]> wrote:
* run_queue - affected nodes had a run queue of several dozens (less than 100), while un-affected nodes had 0 (always). Since run_queues is undocumented (at least I didn't see it in the documentation), I didn't run it at the time of the problem.

A queues load can be seen as a (real) number K.

If K < 1 it means your system can dequeue messages faster then they arrive. This leads to a queue size of 0 over time.

If K = 1 it means that your system dequeues at the same rate as the arrival rate. This leads to a standing queue.

If K > 1 the queue will slowly fill up because the arrival rate is larger than the processing/dequeue-rate.

In your case, you are either in the K = 1 or the K > 1 situation for a while. This usually leads to more load on the system because there is more work to do. Note, however, that a 100% CPU load isn't necessarily a problem, unless response latencies are also affected. If you start a periodic background job which is CPU bound, this will take up all the free resources, but it will hopefully be scheduled out of the core whenever other work arrives to make way for faster processing.

In other words, you may want to figure out what happens inside the processes with the larger message queues, and what events could lead to the longer message queues. A common case is that there is a specific user or subsystem which invokes the situation through normal use. But the use hits an edge-case.



--
J.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Identifying causes for high CPU usage

Eli Iser
The interesting thing is that I didn't see any processes that have long message queues. During regular operation of the nodes, I rarely see more than 5 messages in the queue of the most loaded processes. During the period of high CPU usage this didn't change. This made me think that this is not a problem with regular Erlang code load, but with something else.


On 23 June 2016 at 15:33, Jesper Louis Andersen <[hidden email]> wrote:

On Thu, Jun 23, 2016 at 12:35 PM, Eli Iser <[hidden email]> wrote:
* run_queue - affected nodes had a run queue of several dozens (less than 100), while un-affected nodes had 0 (always). Since run_queues is undocumented (at least I didn't see it in the documentation), I didn't run it at the time of the problem.

A queues load can be seen as a (real) number K.

If K < 1 it means your system can dequeue messages faster then they arrive. This leads to a queue size of 0 over time.

If K = 1 it means that your system dequeues at the same rate as the arrival rate. This leads to a standing queue.

If K > 1 the queue will slowly fill up because the arrival rate is larger than the processing/dequeue-rate.

In your case, you are either in the K = 1 or the K > 1 situation for a while. This usually leads to more load on the system because there is more work to do. Note, however, that a 100% CPU load isn't necessarily a problem, unless response latencies are also affected. If you start a periodic background job which is CPU bound, this will take up all the free resources, but it will hopefully be scheduled out of the core whenever other work arrives to make way for faster processing.

In other words, you may want to figure out what happens inside the processes with the larger message queues, and what events could lead to the longer message queues. A common case is that there is a specific user or subsystem which invokes the situation through normal use. But the use hits an edge-case.



--
J.


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions