Diagnosing performance issue in Discord usage of Erlang on Google Cloud

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Diagnosing performance issue in Discord usage of Erlang on Google Cloud

Mark Smith
Hello,

I'm an SRE at Discord. We make heavy use of Elixir/Erlang in production and it works really well for us. Lately we've been seeing an interesting performance issue with our VMs and have spent quite some hours on understanding it, but haven't found a root cause yet.

The top line summary is: weighted schedular utilization rises from 30% to 80% in a 1-2 minute span while user CPU usage drops from 70% to 30%. This state lasts approximately 5-30 minutes and then recovers just as quickly as it began.


We've managed to isolate the behavior to being related to memory bandwidth/latency. I.e., during the periods we're seeing the above impact, the malloc/mmap latency (as measured by eBPF tracing) increases by 5-10x. This gives us two leading theories for the cause of these issues:

a) Noisy neighbors. We're in a cloud environment, so it's possible that another host on the shared instance is saturating memory bandwidth and causing this impact. We've been working with Google Cloud to investigate this branch.

b) One of our Erlang processes is allocating in an anti-pattern and causing the memory bandwidth saturation. I.e., we're shooting ourselves in the foot.

We are trying to root cause this because of course if it's the second situation, we'd like to understand and fix the problem. We've added a fair bit of instrumentation and stats recording based on recon_alloc and so we do have some pointers, but none of us are knowledgeable enough about the BEAM VM to understand where to go next.

Some interesting things we have determined:


* temp_alloc allocations drops quite a bit, sl_alloc usage rises, and the two become mirrors of each other.
* fix_alloc rises some during the bad periods.
* other allocators seem unaffected.

The sl_alloc feels like an interesting culprit. The number of MBCs in sl_alloc is ~500 in the good periods, but during the bad period climbs to 15,000. However, the number of mseg_alloc calls (from sl_alloc) drops to nearly 0, which doesn't seem to make sense to me.



I have no idea if this is cause or effect, and so my basic question is to ask for advice on how to determine that and also what people would suggest we do with the investigation.

I'm available on IRC as 'zorkian' and on Discord as zorkian#0001, if anybody wants to follow up more immediately/ask for more data.

Thank you for reading!

--
Mark Smith
SRE @ Discord

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Diagnosing performance issue in Discord usage of Erlang on Google Cloud

Andy Till

Hi Mark

I have seen the high scheduler/low CPU pattern before when a system had lock contention in ets tables. I was able to verify this was the case using lcnt.

Cheers
Andy

On 17/02/2018 22:17, Mark Smith wrote:
Hello,

I'm an SRE at Discord. We make heavy use of Elixir/Erlang in production and it works really well for us. Lately we've been seeing an interesting performance issue with our VMs and have spent quite some hours on understanding it, but haven't found a root cause yet.

The top line summary is: weighted schedular utilization rises from 30% to 80% in a 1-2 minute span while user CPU usage drops from 70% to 30%. This state lasts approximately 5-30 minutes and then recovers just as quickly as it began.


We've managed to isolate the behavior to being related to memory bandwidth/latency. I.e., during the periods we're seeing the above impact, the malloc/mmap latency (as measured by eBPF tracing) increases by 5-10x. This gives us two leading theories for the cause of these issues:

a) Noisy neighbors. We're in a cloud environment, so it's possible that another host on the shared instance is saturating memory bandwidth and causing this impact. We've been working with Google Cloud to investigate this branch.

b) One of our Erlang processes is allocating in an anti-pattern and causing the memory bandwidth saturation. I.e., we're shooting ourselves in the foot.

We are trying to root cause this because of course if it's the second situation, we'd like to understand and fix the problem. We've added a fair bit of instrumentation and stats recording based on recon_alloc and so we do have some pointers, but none of us are knowledgeable enough about the BEAM VM to understand where to go next.

Some interesting things we have determined:


* temp_alloc allocations drops quite a bit, sl_alloc usage rises, and the two become mirrors of each other.
* fix_alloc rises some during the bad periods.
* other allocators seem unaffected.

The sl_alloc feels like an interesting culprit. The number of MBCs in sl_alloc is ~500 in the good periods, but during the bad period climbs to 15,000. However, the number of mseg_alloc calls (from sl_alloc) drops to nearly 0, which doesn't seem to make sense to me.



I have no idea if this is cause or effect, and so my basic question is to ask for advice on how to determine that and also what people would suggest we do with the investigation.

I'm available on IRC as 'zorkian' and on Discord as zorkian#0001, if anybody wants to follow up more immediately/ask for more data.

Thank you for reading!

--
Mark Smith
SRE @ Discord


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Diagnosing performance issue in Discord usage of Erlang on Google Cloud

Max Lapshin-2
Sometimes we can find something interesting with a small script that sorts processes by scheduler time:


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Diagnosing performance issue in Discord usage of Erlang on Google Cloud

Lukas Larsson-8
In reply to this post by Mark Smith
Hello,

It would be useful to know which Erlang/OTP version that you are using and also which (if any) emulator options that you are using.

I would start by trying to find out what allocation it is that is taking place in sl_alloc.

Can you run perf in your production? By sampling the call stack at mmap/malloc calls (or just at some time interval) it should be possible to pinpoint what allocation is causing problems.

Do you notice any difference in the erlang:memory/0 stats during this period? sl_alloc can allocate in most of the erlang:memory categories, so if one category rises more than others it could help pinpoint what the allocations are. The mappings can be found here: https://github.com/erlang/otp/blob/master/erts/emulator/beam/erl_alloc.types. My guess would be that the allocations end up in the system category, which does not tell us all that much, but at least it can rule out processes.


On Sat, Feb 17, 2018 at 11:17 PM, Mark Smith <[hidden email]> wrote:

* temp_alloc allocations drops quite a bit, sl_alloc usage rises, and the two become mirrors of each other.

Does free also have the same pattern?
 
* fix_alloc rises some during the bad periods.
* other allocators seem unaffected.

The sl_alloc feels like an interesting culprit. The number of MBCs in sl_alloc is ~500 in the good periods, but during the bad period climbs to 15,000. However, the number of mseg_alloc calls (from sl_alloc) drops to nearly 0, which doesn't seem to make sense to me.



First, a bit of naming, what you call MBCs, I would call MBC blocks. The number of MBCs would be what you get from mbcs->carriers.

The allocations could be done from the mbcs pool, which will not show up in the calls statistics for mseg or sys. So it would be interesting to look at the mbcs->carriers and mbcs_pool->carriers to see if the allocation come from there.
 

I have no idea if this is cause or effect, and so my basic question is to ask for advice on how to determine that and also what people would suggest we do with the investigation.

My wild guess; for some reason some syscalls/libc calls (maybe mmap) starts being slower. This causes the system to start building internal queues. The VM's internal queues are usually allocated using sl_alloc, so you see an increase in the number of sl_alloc blocks. When you are doing work, temp_alloc is used for a lot of the work that, and thus it decreases when the system is blocking in a syscall. The reason that sl_alloc and temp_alloc end up in lock step is because each of the queue jobs result in one temp allocation.

One place that comes to mind that this could happen is if a port gets stuck in writev when sending data on a socket. Each port operation speculatively does a temp alloc for the data to send, and then if the port is busy it does an sl alloc to schedule the job for later. If the port is not busy, only a temp alloc is done and no sl alloc. Tcp ports use non-blocking I/O, but i'm not sure what the guarantees of that is in different cloud environments.

Lukas

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions