Changing the representation of sets

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Changing the representation of sets

Michał Muskała
Hello everybody.

The sets module implements an opaque structure to represent sets. The current implementation is based on hashes and nested tuples. Since OTP 18, we have maps that are an excellent and performant data structure for storing arbitrary keys & values. This means, they can be used to efficiently implement sets, by storing set elements as keys with some default value. 

My measurements [1] show that a set implementation based on maps is 2-4 times faster in all operations than the current implementation from the sets module. Comparing performance tradeoffs to other set implementations, it seems that map-based sets have exactly the same characteristics as current sets, except they are faster. Map-based sets also have a huge advantage of being easily readable when printed in the console. Given the sets data structure is opaque it should be possible to just switch implementations and make everybody's code faster without changing a line of code. The only issue might be incorrect uses of the sets module that assume some particular shape for the structure - I'm not sure how widespread that is.

Another possibility would be to implement a new map_sets module, but I think 4 set implementation is already too many - no need to add another. There's also no need to write a new implementation since there's already a well tested, map-based set implementation available in OTP in the cerl_sets module in the compiler application [2].

A similar consideration could be made for the dict module, but I haven't looked closely at the performance differences there - I would expect something similar, though, since sets and dict use a similar implementation.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Changing the representation of sets

Michael Truog
On 04/24/2017 11:51 PM, Michał Muskała wrote:
Hello everybody.

The sets module implements an opaque structure to represent sets. The current implementation is based on hashes and nested tuples. Since OTP 18, we have maps that are an excellent and performant data structure for storing arbitrary keys & values. This means, they can be used to efficiently implement sets, by storing set elements as keys with some default value. 

My measurements [1] show that a set implementation based on maps is 2-4 times faster in all operations than the current implementation from the sets module. Comparing performance tradeoffs to other set implementations, it seems that map-based sets have exactly the same characteristics as current sets, except they are faster. Map-based sets also have a huge advantage of being easily readable when printed in the console. Given the sets data structure is opaque it should be possible to just switch implementations and make everybody's code faster without changing a line of code. The only issue might be incorrect uses of the sets module that assume some particular shape for the structure - I'm not sure how widespread that is.

Another possibility would be to implement a new map_sets module, but I think 4 set implementation is already too many - no need to add another. There's also no need to write a new implementation since there's already a well tested, map-based set implementation available in OTP in the cerl_sets module in the compiler application [2].

A similar consideration could be made for the dict module, but I haven't looked closely at the performance differences there - I would expect something similar, though, since sets and dict use a similar implementation.


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

It is best to avoid modifying dict and sets, due to unintended consequences.  One I can think of is the impact on memory allocation when a dict or sets is sent through distributed Erlang, since the memory allocation for a map larger than 32 elements (MAP_SMALL_MAP_LIMIT) gets an allocation from HASHMAP_ESTIMATED_HEAP_SIZE(SIZE) == (SIZE*3 + (2*SIZE/5)*2) which can severely exaggerate the size of the map (details in erts/emulator/beam/erl_map.h).  There may be other impacts also, but that is unclear.  Keeping maps and dict/sets separate should be safest to avoid unintended changes in legacy source code.

Best Regards,
Michael


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Changing the representation of sets

Attila Rajmund Nohl
In reply to this post by Michał Muskała
2017-04-25 8:51 GMT+02:00 Michał Muskała <[hidden email]>:
[...]

> My measurements [1] show that a set implementation based on maps is 2-4
> times faster in all operations than the current implementation from the sets
> module. Comparing performance tradeoffs to other set implementations, it
> seems that map-based sets have exactly the same characteristics as current
> sets, except they are faster. Map-based sets also have a huge advantage of
> being easily readable when printed in the console. Given the sets data
> structure is opaque it should be possible to just switch implementations and
> make everybody's code faster without changing a line of code. The only issue
> might be incorrect uses of the sets module that assume some particular shape
> for the structure - I'm not sure how widespread that is.

It would affect everybody who saves the opaque data using an earlier
OTP version, then reads it in newer OTP version (e.g. after upgrade).
Or those who run two nodes on different OTP versions (e.g. during an
upgrade).
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Changing the representation of sets

Jesper Louis Andersen-2
On Tue, Apr 25, 2017 at 1:17 PM Attila Rajmund Nohl <[hidden email]> wrote:
It would affect everybody who saves the opaque data using an earlier
OTP version, then reads it in newer OTP version (e.g. after upgrade).
Or those who run two nodes on different OTP versions (e.g. during an
upgrade).

I like to think of dict and sets to be "opaque" data structures. If you are relying on their internal representation, you will run into trouble. So changing the representation going forward should definitely be possible.

This leaves backwards compatibility. If the maps-optimized module can read and dynamically change old sets implementations into the new format on the fly, it may be possible to gradually replace the old representations with the new.

Some things to look out for in that process is data-at-rest, stored years ago in a database. At some point you would have to reprocess such data, or supply a conversion module which can handle these old formats.

I also think that any system must provide some measure of pushing things forward. That is, each major release of Erlang could contain a limited set of things you now have to do differently, with a clear upgrade path from earlier versions. As long as the set is limited, we can probably handle the rewrites needed. If you value backwards compatibility for forever, you run the risk of getting stale, never upgrading anything.

As for the increased memory copy pressure: I think this should be fixed in the context of "maps" and not be part of the argument as to why one would keep the old sets representation.

As an aside: I've long wanted a way to "tag" an erlang term as "do-not-touch-this" That, is to provide functions:

seal(Tag, Term) -> Sealed
unseal(Tag, Sealed) -> Term

which "hides" the representation of Term if printed, replacing it with some kind of opaque representation (Think a function). For debugging purposes, an auto-unseal could be necessary.

One reason for this is that I can make a representation abstract so users of a library are unlikely to rely on the representation being a certain structure and by accident tightly coupling their code to my code. While we are all mutually consenting adults, experience has shown people tend to rely on internals quite often.


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Changing the representation of sets

Marco Molteni
On 25 Apr 2017, at 15:17, Jesper Louis Andersen <[hidden email]> wrote:

[..]

> I also think that any system must provide some measure of pushing things forward. That is, each major release of Erlang could contain a limited set of things you now have to do differently, with a clear upgrade path from earlier versions. As long as the set is limited, we can probably handle the rewrites needed. If you value backwards compatibility for forever, you run the risk of getting stale, never upgrading anything.

AMEN.

Marco

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Changing the representation of sets

Michał Muskała
In reply to this post by Michael Truog
I checked how the memory sizes compare. 

Map-based sets are indeed larger - about 70% larger in memory than current sets, however they are smaller when serialised with term_to_binary (assuming using [] for the bogus value) by about 5%. That is when storing plain integers - when storing bigger terms, the difference quickly melts.

Maps are smaller than dict - by about 12% in memory and 40% when serialised.

Michał.

On 25 Apr 2017, 09:18 +0200, Michael Truog <[hidden email]>, wrote:
On 04/24/2017 11:51 PM, Michał Muskała wrote:
Hello everybody.

The sets module implements an opaque structure to represent sets. The current implementation is based on hashes and nested tuples. Since OTP 18, we have maps that are an excellent and performant data structure for storing arbitrary keys & values. This means, they can be used to efficiently implement sets, by storing set elements as keys with some default value. 

My measurements [1] show that a set implementation based on maps is 2-4 times faster in all operations than the current implementation from the sets module. Comparing performance tradeoffs to other set implementations, it seems that map-based sets have exactly the same characteristics as current sets, except they are faster. Map-based sets also have a huge advantage of being easily readable when printed in the console. Given the sets data structure is opaque it should be possible to just switch implementations and make everybody's code faster without changing a line of code. The only issue might be incorrect uses of the sets module that assume some particular shape for the structure - I'm not sure how widespread that is.

Another possibility would be to implement a new map_sets module, but I think 4 set implementation is already too many - no need to add another. There's also no need to write a new implementation since there's already a well tested, map-based set implementation available in OTP in the cerl_sets module in the compiler application [2].

A similar consideration could be made for the dict module, but I haven't looked closely at the performance differences there - I would expect something similar, though, since sets and dict use a similar implementation.


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

It is best to avoid modifying dict and sets, due to unintended consequences.  One I can think of is the impact on memory allocation when a dict or sets is sent through distributed Erlang, since the memory allocation for a map larger than 32 elements (MAP_SMALL_MAP_LIMIT) gets an allocation from HASHMAP_ESTIMATED_HEAP_SIZE(SIZE) == (SIZE*3 + (2*SIZE/5)*2) which can severely exaggerate the size of the map (details in erts/emulator/beam/erl_map.h).  There may be other impacts also, but that is unclear.  Keeping maps and dict/sets separate should be safest to avoid unintended changes in legacy source code.

Best Regards,
Michael


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Changing the representation of sets

Michael Truog
In reply to this post by Jesper Louis Andersen-2
On 04/25/2017 06:17 AM, Jesper Louis Andersen wrote:
On Tue, Apr 25, 2017 at 1:17 PM Attila Rajmund Nohl <[hidden email]> wrote:
It would affect everybody who saves the opaque data using an earlier
OTP version, then reads it in newer OTP version (e.g. after upgrade).
Or those who run two nodes on different OTP versions (e.g. during an
upgrade).

I like to think of dict and sets to be "opaque" data structures. If you are relying on their internal representation, you will run into trouble. So changing the representation going forward should definitely be possible.

This leaves backwards compatibility. If the maps-optimized module can read and dynamically change old sets implementations into the new format on the fly, it may be possible to gradually replace the old representations with the new.

Some things to look out for in that process is data-at-rest, stored years ago in a database. At some point you would have to reprocess such data, or supply a conversion module which can handle these old formats.

I also think that any system must provide some measure of pushing things forward. That is, each major release of Erlang could contain a limited set of things you now have to do differently, with a clear upgrade path from earlier versions. As long as the set is limited, we can probably handle the rewrites needed. If you value backwards compatibility for forever, you run the risk of getting stale, never upgrading anything.

As for the increased memory copy pressure: I think this should be fixed in the context of "maps" and not be part of the argument as to why one would keep the old sets representation.
The memory allocation related to maps is likely not something to solve, and should be related to its data structure (HAMT (Hash-Array Mapped Trie)).  Keeping things stable is important and there should never be changes that sacrifice stability for a passing whim.  It is easy to add new modules for maps used for sets and it is important to avoid adversely affecting legacy source code.  That doesn't block future improvements, it only makes sure they are focused on being improvements.

Best Regards,
Michael

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Changing the representation of sets

Richard A. O'Keefe-2
In reply to this post by Michał Muskała
Michael Muskala suggests *replacing* 'sets' (and maybe 'dict')
with something based on maps, and offers numbers to show that
*using* something based on maps would be faster.

Primum non nocere.

Yes, it would be good to have better set and dict modules.
But they should be *new* modules.
The old modules can be deprecated, that's fine,
and even eventually removed (so that people have to pull
them out of old distributions if they want to keep using
them).
But just up and changing the representation is likely to
break programs with persistent data.



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Changing the representation of sets

Tristan Sloughter-4
> But they should be *new* modules.
> The old modules can be deprecated, that's fine,
> and even eventually removed (so that people have to pull
> them out of old distributions if they want to keep using
> them).

Time for sets2 and dict2 :). We've already got pg2, phash2, rebar3...
Putting numbers at the end of functions/modules/programs seems to be a
pattern in Erlang.

I'd definitely want it deprecated and removed eventually if it were
determined the memory size tradeoff is worth it for the speed
improvements.

Personally I'd prefer to just get a map when my current code upgraded
OTP versions, but understand that especially since sets and dicts aren't
types, so binary_to_term can't do the conversion (aside from checking
the structure of the tuple it decodes but that is scary), this is asking
for trouble.

Tristan
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Changing the representation of sets

Andreas Schultz
In reply to this post by Attila Rajmund Nohl
Hi,

----- On Apr 25, 2017, at 1:17 PM, Attila Rajmund Nohl [hidden email] wrote:

[...]

> It would affect everybody who saves the opaque data using an earlier
> OTP version, then reads it in newer OTP version (e.g. after upgrade).

That's easily solved by having the sets module accept the old representation
and convert it on the fly.

> Or those who run two nodes on different OTP versions (e.g. during an
> upgrade).

That is a different beast. One to solve it would be to not automatically
convert between old and new representation. Keep the old structure (and
code) and add a parameter to all function converting to sets to choose
between the old and new internal representation.

We could also have some global setting telling the sets to module to
auto-convert or not (defaulting to "do not convert).

Andreas
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions