How can we increase multiblock carrier utilization for binary_alloc?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

How can we increase multiblock carrier utilization for binary_alloc?

Gerhard Lazu
We are running on Erlang 20.3.4 on Linux 4.4.0-119-generic 14.04.1-Ubuntu SMP x86_64. These are the flags that we use for beam.smp:

/var/vcap/packages/erlang-20.3.4/lib/erlang/erts-9.3/bin/beam.smp -W w -A 64 -zdbbl 128000 -K true -stbt db -zdbbl 128000 -P 1048576 -t 5000000 -MHas ageffcbf -MBas ageffcbf -MHlmbcs 512 -MBlmbcs 512

When the node boots, the binary_alloc multiblock carrier utilization is ~99% (90MB allocated, 90MB used). [1]

As the load on the node starts, binary_alloc mbcs util drops to ~78% (~800MB allocated, ~625MB used). [2]

When load stops completely, binary_alloc mbcs util drops to ~61%  (~55MB allocated, ~35MB used) & mbcs_pool goes to ~36% (~215MB allocated, ~80MB used). In the context of RabbitMQ, this is a big problem since memory usage controls whether incoming messages are blocked or not (a.k.a. memory alarm). It's essential that the Erlang VM utilises memory as efficiently as possible, otherwise nodes under no load can remain blocked permanently. [3]

Our goal is for the Erlang VM to have as little unused memory as possible. As you could see in the referenced screenshots [1][2][3], the total unused memory starts at ~30MB and grows to ~280MB. Considering that the total RSS memory that the beam.smp process uses is 545MB [3], half of it goes unused (~280MB), and this is a big problem for RabbitMQ, as mentioned above.

Lukas, you've shared some excellent documentation in the past around the topic of memory management in Erlang. I am wondering if you have deeper/more refined insights that could help our current challenge. In Erlang Memory Management Battle Stories [5], on slide 29, you mention "Decreasing largest mbc size will make more carriers and hopefully be able to free them". In our case, it doesn't, and I'm hoping that you can point us in the right direction.

Ferd, I've studied your past battles with Erlang's memory management, and I can only thank you for sharing so much over the years. Erlang in Anger [6] and recon [7] helped immensely, thank you. I would appreciate greatly if you could nudge us in the right direction, maybe we've missed something.

During this exploration, there's a specific thing that's been bugging us: why is RSS smaller than the allocated memory? 

Thank you all, Gerhard & Loïc.


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How can we increase multiblock carrier utilization for binary_alloc?

Lukas Larsson-8
Hello,

On Tue, May 1, 2018 at 3:05 PM, Gerhard Lazu <[hidden email]> wrote:
We are running on Erlang 20.3.4 on Linux 4.4.0-119-generic 14.04.1-Ubuntu SMP x86_64. These are the flags that we use for beam.smp:

/var/vcap/packages/erlang-20.3.4/lib/erlang/erts-9.3/bin/beam.smp -W w -A 64 -zdbbl 128000 -K true -stbt db -zdbbl 128000 -P 1048576 -t 5000000 -MHas ageffcbf -MBas ageffcbf -MHlmbcs 512 -MBlmbcs 512

When the node boots, the binary_alloc multiblock carrier utilization is ~99% (90MB allocated, 90MB used). [1]

As the load on the node starts, binary_alloc mbcs util drops to ~78% (~800MB allocated, ~625MB used). [2]

When load stops completely, binary_alloc mbcs util drops to ~61%  (~55MB allocated, ~35MB used) & mbcs_pool goes to ~36% (~215MB allocated, ~80MB used). 

In the context of RabbitMQ, this is a big problem since memory usage controls whether incoming messages are blocked or not (a.k.a. memory alarm). It's essential that the Erlang VM utilises memory as efficiently as possible, otherwise nodes under no load can remain blocked permanently. [3]

Our goal is for the Erlang VM to have as little unused memory as possible. As you could see in the referenced screenshots [1][2][3], the total unused memory starts at ~30MB and grows to ~280MB. Considering that the total RSS memory that the beam.smp process uses is 545MB [3], half of it goes unused (~280MB), and this is a big problem for RabbitMQ, as mentioned above.

Why do you not use erlang:memory() as the base for whether you can accept more messages? Having a low memory utilisation is not bad in itself, unless of course some other program on the same machine needs the memory.

Looking at the used memory under load and after, at peak the allocated memory is 1510 MB and then after it is 577 MB. So about 2/3rds of the allocated memory was returned to the OS. While this is not perfect, it is not terrible either. Reducing it further may not be easy.
 

Lukas, you've shared some excellent documentation in the past around the topic of memory management in Erlang. I am wondering if you have deeper/more refined insights that could help our current challenge. In Erlang Memory Management Battle Stories [5], on slide 29, you mention "Decreasing largest mbc size will make more carriers and hopefully be able to free them". In our case, it doesn't, and I'm hoping that you can point us in the right direction.

Can you reproduce the behaviour? Would it be possible to get a recon_alloc snaphot during and after load?

Have you seen https://github.com/erlang/otp/pull/1790 that was just merged to master with the accompanying blog post: http://blog.erlang.org/Memory-instrumentation-in-OTP-21/?

I assume that you have tried ageffcaobf?
 

During this exploration, there's a specific thing that's been bugging us: why is RSS smaller than the allocated memory? 

Every time I try to understand how RSS works I just end up getting more confused.
 

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How can we increase multiblock carrier utilization for binary_alloc?

Jesper Louis Andersen-2
In reply to this post by Gerhard Lazu
On Tue, May 1, 2018 at 3:06 PM Gerhard Lazu <[hidden email]> wrote:

During this exploration, there's a specific thing that's been bugging us: why is RSS smaller than the allocated memory? 


This seems fairly obvious to me, but perhaps I am missing something. The Erlang system has allocated memory from the kernel, but the kernel has not yet handed that memory out to the process, and hence it is not in the RSS (Resident Set Size). As you hit new pages, there should be kernel traps, a page is allocated to the process (bumping RSS) and the program is resumed. If you allocate a larger carrier the "inner parts" of it might not be allocated before the first access.

The other situation is that you have excessive memory pressure and the system starts removing pages which are possible to remove (they either bear no data, have already been written to the page/swap file on disk, have libraries in them, or madvise(2) have been called and told the OS that the pages are not needed[0]).

In general, most systems with a GC doesn't give memory back to the OS straight away. They either keep it indefinitely (perhaps with an madvise(2) on the areas they don't need), or they have a reaper later on which can give back fully unused "spans" of memory (Go does this, for instance, but it isn't instant).


[0] Be *very* cautious with madvise(2) since its implementation semantics are slightly different on Linux/BSD/Illumos. especially around DONTNEED/FREE and friends, which have different semantics. In particular, if a don't need a page right now, is the operating system allowed to hand me a zeroed page later if I start going for that memory again. Bryan Cantrill has some interesting pointers w.r.t. Lx-branded Illumos Zones and this :)


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How can we increase multiblock carrier utilization for binary_alloc?

Gerhard Lazu
In reply to this post by Lukas Larsson-8
Hi Lukas,

Why do you not use erlang:memory() as the base for whether you can accept more messages? Having a low memory utilisation is not bad in itself, unless of course some other program on the same machine needs the memory.

We used to use erlang:memory(), but we've learned that it doesn't work well in practice [1]. Linux OOM will take action based on RSS, not Erlang allocated memory.
 
Looking at the used memory under load and after, at peak the allocated memory is 1510 MB and then after it is 577 MB. So about 2/3rds of the allocated memory was returned to the OS. While this is not perfect, it is not terrible either. Reducing it further may not be easy.

Your observation is true and accurate. It's also true and accurate that out of 577MB allocated, 300MB is used & 277MB is unused, meaning that almost half of the allocated memory is not used.

I understand that it may not be easy to reduce the unused memory, but all I'm thinking is that whilst this unused memory might seem small in this particular scenario, what happens when the Erlang VM has 60GB allocated?

Would it help if we can show the impact of this behaviour on hosts with larger memory usage?
 
Can you reproduce the behaviour? Would it be possible to get a recon_alloc snaphot during and after load?

I'm sharing recon_alloc snapshots during & after load for the following:

1. erts_alloc defaults (lmbcs 5120) [2]
2. -MBlmbcs 512 [3]
 
I've also captured a during & after load screenshots of the 2 configurations running side-by-side (left is erts_alloc defaults (lmbcs 5120), right is -MBlmbcs 512) [4].

While our initial configuration used a few more flags, -MHas ageffcbf -MBas ageffcbf -MHlmbcs 512 -MBlmbcs 512, I've kept things as simple as possible on this run, and only used -MBlmbcs 512

Have you seen https://github.com/erlang/otp/pull/1790 that was just merged to master with the accompanying blog post: http://blog.erlang.org/Memory-instrumentation-in-OTP-21/?

I haven't, thank you for sharing. We are waiting on Elixir #6611 before we can test against OTP 21.0-rc1 [5].
 
I assume that you have tried ageffcaobf?

Yes, we have tried all allocation strategies. ageffcbf resulted in "spikier" CPU and dirty mem writeback, but also sharper drops in dirty mem writeback. Under load, ageffcbf had 1% lower RSS usage, and 2.5% lower unused memory than ageffcaobf. After load however, ageffcbf had 5% lower RSS usage & 4% lower unused memory. In conclusion, ageffcbf proved the best out of all allocation strategies.

Here is a side-by-side comparison of -MBas ageffcaobf -MBlmbcs 512 (left) vs -MBas ageffcbf -MBlmbcs 512 (right) [6], and the relevant recon_alloc snapshots [7].

Thank you Lukas for helping out with this, Gerhard.



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How can we increase multiblock carrier utilization for binary_alloc?

Gerhard Lazu
In reply to this post by Jesper Louis Andersen-2
Hi Jesper,

This is a great explanation, I was able to connect the missing dots.

Thank you very much, Gerhard.

On Wed, May 2, 2018 at 1:51 PM, Jesper Louis Andersen <[hidden email]> wrote:
On Tue, May 1, 2018 at 3:06 PM Gerhard Lazu <[hidden email]> wrote:

During this exploration, there's a specific thing that's been bugging us: why is RSS smaller than the allocated memory? 


This seems fairly obvious to me, but perhaps I am missing something. The Erlang system has allocated memory from the kernel, but the kernel has not yet handed that memory out to the process, and hence it is not in the RSS (Resident Set Size). As you hit new pages, there should be kernel traps, a page is allocated to the process (bumping RSS) and the program is resumed. If you allocate a larger carrier the "inner parts" of it might not be allocated before the first access.

The other situation is that you have excessive memory pressure and the system starts removing pages which are possible to remove (they either bear no data, have already been written to the page/swap file on disk, have libraries in them, or madvise(2) have been called and told the OS that the pages are not needed[0]).

In general, most systems with a GC doesn't give memory back to the OS straight away. They either keep it indefinitely (perhaps with an madvise(2) on the areas they don't need), or they have a reaper later on which can give back fully unused "spans" of memory (Go does this, for instance, but it isn't instant).


[0] Be *very* cautious with madvise(2) since its implementation semantics are slightly different on Linux/BSD/Illumos. especially around DONTNEED/FREE and friends, which have different semantics. In particular, if a don't need a page right now, is the operating system allowed to hand me a zeroed page later if I start going for that memory again. Bryan Cantrill has some interesting pointers w.r.t. Lx-branded Illumos Zones and this :)



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: How can we increase multiblock carrier utilization for binary_alloc?

Lukas Larsson-8
In reply to this post by Gerhard Lazu


On Thu, May 3, 2018 at 12:18 PM, Gerhard Lazu <[hidden email]> wrote:
Hi Lukas,

Why do you not use erlang:memory() as the base for whether you can accept more messages? Having a low memory utilisation is not bad in itself, unless of course some other program on the same machine needs the memory.

We used to use erlang:memory(), but we've learned that it doesn't work well in practice [1]. Linux OOM will take action based on RSS, not Erlang allocated memory.

Yes, looking at erlang:memory() could make you end up in those scenarios. However it should be possible to look at recon_alloc:memory(unused), to get a ballpark figure about how much memory that the Erlang VM has reserved that it is not using at the moment.
 
 
Looking at the used memory under load and after, at peak the allocated memory is 1510 MB and then after it is 577 MB. So about 2/3rds of the allocated memory was returned to the OS. While this is not perfect, it is not terrible either. Reducing it further may not be easy.

Your observation is true and accurate. It's also true and accurate that out of 577MB allocated, 300MB is used & 277MB is unused, meaning that almost half of the allocated memory is not used.

I understand that it may not be easy to reduce the unused memory, but all I'm thinking is that whilst this unused memory might seem small in this particular scenario, what happens when the Erlang VM has 60GB allocated?

One thing that you could try is to see if malloc does a better job than erts does. In general the erts allocators are better at scalability, while malloc is better at performance. I don't know which is better at dealing with fragmentation, use "+MBe false" to disable it for binary alloc. The really bad part about this is that you loose all statistics. So if you do get into other issues it will be much harder to figure out what is going on.

You could also try to play with the sbct to see if you get better allocations using a lower value. This will cause more allocations to be places into sbcs which could be good for fragmentation. You seem to have an average block size of about 2 kb, so I would try setting the sbct to about double that to start with and see if you notice any difference. It's hard to know what be a good value with the OTP-20 instrumentation, so you will have to try and see if you get any difference.
 
Would it help if we can show the impact of this behaviour on hosts with larger memory usage?

No I don't think so. The behavior should be similar just with scaled up values. I suppose that it would be a good idea to verify that if you can.
 
 
Can you reproduce the behaviour? Would it be possible to get a recon_alloc snaphot during and after load?

I'm sharing recon_alloc snapshots during & after load for the following:

1. erts_alloc defaults (lmbcs 5120) [2]
2. -MBlmbcs 512 [3]
 
I've also captured a during & after load screenshots of the 2 configurations running side-by-side (left is erts_alloc defaults (lmbcs 5120), right is -MBlmbcs 512) [4].

While our initial configuration used a few more flags, -MHas ageffcbf -MBas ageffcbf -MHlmbcs 512 -MBlmbcs 512, I've kept things as simple as possible on this run, and only used -MBlmbcs 512

Have you seen https://github.com/erlang/otp/pull/1790 that was just merged to master with the accompanying blog post: http://blog.erlang.org/Memory-instrumentation-in-OTP-21/?

I haven't, thank you for sharing. We are waiting on Elixir #6611 before we can test against OTP 21.0-rc1 [5].
 
I assume that you have tried ageffcaobf?

Yes, we have tried all allocation strategies. ageffcbf resulted in "spikier" CPU and dirty mem writeback, but also sharper drops in dirty mem writeback. Under load, ageffcbf had 1% lower RSS usage, and 2.5% lower unused memory than ageffcaobf. After load however, ageffcbf had 5% lower RSS usage & 4% lower unused memory. In conclusion, ageffcbf proved the best out of all allocation strategies.
 
Here is a side-by-side comparison of -MBas ageffcaobf -MBlmbcs 512 (left) vs -MBas ageffcbf -MBlmbcs 512 (right) [6], and the relevant recon_alloc snapshots [7].

While aobf should be a little bit better, I suspect that since we introduced the carrier pool and thus use the the carrier strategies the difference is within the error margins, especially when you run a system with lots of small carriers.

On a related point, this discussion has caused me to start looking at using madvise/VirtualAlloc to let the OS know that pages within carriers are not used by erts any more. I'm not sure how that will interact with RSS as from I I've been able to figure out the pages remain associated with the program until the OS needs them so satisfy some memory request. Any such feature won't be in OTP-21, but may be added later.
 
Thank you Lukas for helping out with this, Gerhard.



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions