Use of spinlocks in Erlang

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Use of spinlocks in Erlang

Andreas Schultz-3
Hi,

There is an interesting discussion going on about the "correct" use of spinlocks in user space applications on Linux on the LKML. That discussion prompted me to check the OTP source on how locks are implemented.
Some of the arguments of the LKML discussion would seem to apply to the spinlocks (and maybe event the mutexes) used in Erlangs ERTS as well.

The main takeaway seem to be this quote from Linus:

I repeat: do not use spinlocks in user space, unless you actually know what you're doing. And be aware that the likelihood that you know what you are doing is basically nil.

Could any of the Erlang gurus or ERTS developers comment on the use of home grown mutexes and spinlocks in Erlang in light of that discussion?
Reply | Threaded
Open this post in threaded view
|

Re: Use of spinlocks in Erlang

Jesper Louis Andersen-2
On Tue, Jan 7, 2020 at 10:01 AM Andreas Schultz <[hidden email]> wrote:

Could any of the Erlang gurus or ERTS developers comment on the use of home grown mutexes and spinlocks in Erlang in light of that discussion?


Not a locking expert in the ERTS, but:

The locking API is there to provide a factoring point for the BEAM. The idea is that we have a locking API to use in ERTS which then maps to different implementations as we see fit. This allows us to target both UNIX, Windows, and embedded OS'es at the same time, with a consistent API. But it gets better than that!

We can use the factoring point to "plug in" different locking models and try them out. This allows us to choose a locking model which works well. It also allows us to plug in various debugging/correctness checks at compile time. For instance that lock ordering is preserved, if locks are contended on, or if a deadlock happened.

That is, the current locking model is sort-of chosen at compile time, based on the capabilities of the underlying system. 

All in all, I'm going to claim that the system is not so much "home grown" as it is a sensible wrapper around an underlying system.


--
J.
Reply | Threaded
Open this post in threaded view
|

Re: Use of spinlocks in Erlang

Andreas Schultz-3


Am Mi., 15. Jan. 2020 um 17:41 Uhr schrieb Jesper Louis Andersen <[hidden email]>:
On Tue, Jan 7, 2020 at 10:01 AM Andreas Schultz <[hidden email]> wrote:

Could any of the Erlang gurus or ERTS developers comment on the use of home grown mutexes and spinlocks in Erlang in light of that discussion?


Not a locking expert in the ERTS, but:

The locking API is there to provide a factoring point for the BEAM. The idea is that we have a locking API to use in ERTS which then maps to different implementations as we see fit. This allows us to target both UNIX, Windows, and embedded OS'es at the same time, with a consistent API. But it gets better than that!

We can use the factoring point to "plug in" different locking models and try them out. This allows us to choose a locking model which works well. It also allows us to plug in various debugging/correctness checks at compile time. For instance that lock ordering is preserved, if locks are contended on, or if a deadlock happened.

That is, the current locking model is sort-of chosen at compile time, based on the capabilities of the underlying system. 

All in all, I'm going to claim that the system is not so much "home grown" as it is a sensible wrapper around an underlying system.

I have seen the different underlying implementations for the locking primitives. There are wrappers for Windows and POSIX threads (pthread). But there also seems to be a completely independent implementation of mutexes and spinlocks that does not use the OS or libc provided primitives.
On Linux that Erlang specific implementation appears to be used by default (that is what I called "home grown"). The pthread version is only used when Erlang is build with valgrind support (or so it seems).

I was thinking about hacking a build to use only the pthread locking primitives (without the other valdrind overhead) and run some benchmarks.
However, my hope was that some Erlang expert could shed some light in the locking before I spent the time to do that.

Anyhow, does anybody have a suggestion for an existing benchmark that could demonstrate differences between locking primitives?

Andreas



--
J.


--

Andreas Schultz

Reply | Threaded
Open this post in threaded view
|

Re: Use of spinlocks in Erlang

Rickard Green-2
On Wed, Jan 15, 2020 at 5:54 PM Andreas Schultz <[hidden email]> wrote:

>
>
>
> Am Mi., 15. Jan. 2020 um 17:41 Uhr schrieb Jesper Louis Andersen <[hidden email]>:
>>
>> On Tue, Jan 7, 2020 at 10:01 AM Andreas Schultz <[hidden email]> wrote:
>>>
>>>
>>> Could any of the Erlang gurus or ERTS developers comment on the use of home grown mutexes and spinlocks in Erlang in light of that discussion?
>>>
>>
>> Not a locking expert in the ERTS, but:
>>
>> The locking API is there to provide a factoring point for the BEAM. The idea is that we have a locking API to use in ERTS which then maps to different implementations as we see fit. This allows us to target both UNIX, Windows, and embedded OS'es at the same time, with a consistent API. But it gets better than that!
>>
>> We can use the factoring point to "plug in" different locking models and try them out. This allows us to choose a locking model which works well. It also allows us to plug in various debugging/correctness checks at compile time. For instance that lock ordering is preserved, if locks are contended on, or if a deadlock happened.
>>
>> That is, the current locking model is sort-of chosen at compile time, based on the capabilities of the underlying system.
>>
>> All in all, I'm going to claim that the system is not so much "home grown" as it is a sensible wrapper around an underlying system.
>
>
> I have seen the different underlying implementations for the locking primitives. There are wrappers for Windows and POSIX threads (pthread). But there also seems to be a completely independent implementation of mutexes and spinlocks that does not use the OS or libc provided primitives.
> On Linux that Erlang specific implementation appears to be used by default (that is what I called "home grown"). The pthread version is only used when Erlang is build with valgrind support (or so it seems).
>
> I was thinking about hacking a build to use only the pthread locking primitives (without the other valdrind overhead) and run some benchmarks.
> However, my hope was that some Erlang expert could shed some light in the locking before I spent the time to do that.
>
> Anyhow, does anybody have a suggestion for an existing benchmark that could demonstrate differences between locking primitives?
>
> Andreas
>
>>
>>
>> --
>> J.
>
>
>
> --
>
> Andreas Schultz


There are more or less no pure spinlocks left in the VM (there might be in some test-code and/or at some initialization phase). The last one at a potentially heavily contended place was removed not long ago though <http://erlang.org/download/OTP-22.1.8.README>.

We have "homegrown" lock implementations that spin a while trying to acquire a lock that is locked. The spinning since it significantly improve performance. The spinning is bounded though and a thread not successfully acquiring a lock will eventually go to sleep.

The two major "homegrown" lock implementations:

-- Process locks --

Our process lock implementation is mainly a memory optimization. Each process (currently) have 5 different locks.The 5 locks of a process consumes 6 words in total. When using pthread mutexes 25 words are used (on my linux machine). Our process lock implementation also makes it cheaper to lock/unlock multiple locks of one processes in one operation. The cost of locking all 5 in one operation is the same as locking one of them.

-- Rwlocks --

This implementation came about since the NPTL *) rwlocks was not fair. By default a reader preferred approach was used by NPTL rwlocks which easily could (and did) starve writers. You could configure them as writer preferred which was a bit better, but the main problem with starvation still remained (in this case of readers). The performance of the NPTL rwlocks was also quite bad in most scenarios. I write "was" since I don't know if changes has been made since the last time I looked at this.

Besides being fair our rwlock implementation also have a reader optimized mode which improve performance in scenarios where you mostly only read-lock the rwlock. This by keeping track of readers in separate cachelines and by this avoiding cacheline ping-ponging. On a big smp machine this improvement can be huge. Such rwlocks are e.g. used when you pass the read_concurrency option to ets tables.

*) Native POSIX Thread Library (used on Linux)


The NPTL rwlocks was quite a disappointment, but NPTL mutexes was the opposite. The performance was really good. Especially in heavily contended cases. On Linux they are used for all "mutex-cases" except for process locks.

Regards,
Rickard
--
Rickard Green, Erlang/OTP, Ericsson AB
Reply | Threaded
Open this post in threaded view
|

Re: Use of spinlocks in Erlang

Jaka Bac
Since this question has popped up, I looked at how the ethread part of locking is written for Windows (as opposed to Linux). The waits here also seem to spin potentially (on both Windows and POSIX implementation)
Linux implementation is using futexes while standard Windows events are used on Windows.

The question is what is the earliest version of Windows that should be officially supported since starting with Windows 8, there is a mechanism similar to futexes available (and everything which does not have this feature is pretty much EOL'ed from MS now)

I know that Windows port of Erlang is not a priority, but would it make sense to update the thread events to use WaitOnAddress/WakeByAddressXXX?
Before I do any experiments by myself I would like to know if the Erlang/OTP gurus see a potential gain here.

Thanks,
Jaka
 

On Wed, 15 Jan 2020 at 23:03, Rickard Green <[hidden email]> wrote:
On Wed, Jan 15, 2020 at 5:54 PM Andreas Schultz <[hidden email]> wrote:

>
>
>
> Am Mi., 15. Jan. 2020 um 17:41 Uhr schrieb Jesper Louis Andersen <[hidden email]>:
>>
>> On Tue, Jan 7, 2020 at 10:01 AM Andreas Schultz <[hidden email]> wrote:
>>>
>>>
>>> Could any of the Erlang gurus or ERTS developers comment on the use of home grown mutexes and spinlocks in Erlang in light of that discussion?
>>>
>>
>> Not a locking expert in the ERTS, but:
>>
>> The locking API is there to provide a factoring point for the BEAM. The idea is that we have a locking API to use in ERTS which then maps to different implementations as we see fit. This allows us to target both UNIX, Windows, and embedded OS'es at the same time, with a consistent API. But it gets better than that!
>>
>> We can use the factoring point to "plug in" different locking models and try them out. This allows us to choose a locking model which works well. It also allows us to plug in various debugging/correctness checks at compile time. For instance that lock ordering is preserved, if locks are contended on, or if a deadlock happened.
>>
>> That is, the current locking model is sort-of chosen at compile time, based on the capabilities of the underlying system.
>>
>> All in all, I'm going to claim that the system is not so much "home grown" as it is a sensible wrapper around an underlying system.
>
>
> I have seen the different underlying implementations for the locking primitives. There are wrappers for Windows and POSIX threads (pthread). But there also seems to be a completely independent implementation of mutexes and spinlocks that does not use the OS or libc provided primitives.
> On Linux that Erlang specific implementation appears to be used by default (that is what I called "home grown"). The pthread version is only used when Erlang is build with valgrind support (or so it seems).
>
> I was thinking about hacking a build to use only the pthread locking primitives (without the other valdrind overhead) and run some benchmarks.
> However, my hope was that some Erlang expert could shed some light in the locking before I spent the time to do that.
>
> Anyhow, does anybody have a suggestion for an existing benchmark that could demonstrate differences between locking primitives?
>
> Andreas
>
>>
>>
>> --
>> J.
>
>
>
> --
>
> Andreas Schultz


There are more or less no pure spinlocks left in the VM (there might be in some test-code and/or at some initialization phase). The last one at a potentially heavily contended place was removed not long ago though <http://erlang.org/download/OTP-22.1.8.README>.

We have "homegrown" lock implementations that spin a while trying to acquire a lock that is locked. The spinning since it significantly improve performance. The spinning is bounded though and a thread not successfully acquiring a lock will eventually go to sleep.

The two major "homegrown" lock implementations:

-- Process locks --

Our process lock implementation is mainly a memory optimization. Each process (currently) have 5 different locks.The 5 locks of a process consumes 6 words in total. When using pthread mutexes 25 words are used (on my linux machine). Our process lock implementation also makes it cheaper to lock/unlock multiple locks of one processes in one operation. The cost of locking all 5 in one operation is the same as locking one of them.

-- Rwlocks --

This implementation came about since the NPTL *) rwlocks was not fair. By default a reader preferred approach was used by NPTL rwlocks which easily could (and did) starve writers. You could configure them as writer preferred which was a bit better, but the main problem with starvation still remained (in this case of readers). The performance of the NPTL rwlocks was also quite bad in most scenarios. I write "was" since I don't know if changes has been made since the last time I looked at this.

Besides being fair our rwlock implementation also have a reader optimized mode which improve performance in scenarios where you mostly only read-lock the rwlock. This by keeping track of readers in separate cachelines and by this avoiding cacheline ping-ponging. On a big smp machine this improvement can be huge. Such rwlocks are e.g. used when you pass the read_concurrency option to ets tables.

*) Native POSIX Thread Library (used on Linux)


The NPTL rwlocks was quite a disappointment, but NPTL mutexes was the opposite. The performance was really good. Especially in heavily contended cases. On Linux they are used for all "mutex-cases" except for process locks.

Regards,
Rickard
--
Rickard Green, Erlang/OTP, Ericsson AB