term_to_binary and large data structures

classic Classic list List threaded Threaded
21 messages Options
Reply | Threaded
Open this post in threaded view

Re: term_to_binary and large data structures

Levi Aul
It could be possible to expose the distribution protocol's `term_to_binary/1` atom cache for external use by other `term_to_binary/1` users.

The atom cache is, itself, a kind of streaming dictionary compression; it replaces a protocol or format with one outermost message type (a serialized term) with one with two outermost message types: a serialized term containing potential dictionary references; or an update to the dictionary.

The dictionary, in this case, contains whole terms (atoms) only. This is essentially the same as what HPACK does for HTTP message streams. HPACK has a very simple API (see https://github.com/twitter/hpack), and the Erlang equivalent could be even simpler.

I'll write up a proposal for this.

On Wed, Jul 4, 2018 at 5:56 AM, Aaron Seigo <[hidden email]> wrote:
On 2018-07-04 13:23, Michał Muskała wrote:

I also believe the current format for maps, which is key1, value1, key2, value2, ... is
not that great for compression. Often, you'd have maps with exact the same keys
(especially in Elixir with structs), and there, a pattern of key1, key2, ..., value1,
value2, ..., should be much better (since the entire keys structure could be compressed
between similar maps).

I can confirm that this is an accurate observation. While not done in Packer, there are notes about this in Packer's code which was the result of some experiments around this. For maps, and *especially* structs in Elixir, this can indeed be a huge win for some messages.

Even more farther afield: what would be a real win, but much harder to accomplish, would be streaming compression. There are protocols (e.g. imap) which can offload compression of common patterns between messages to entries in the compression look up tables. The compression is applied to the entire network stream for the life of the connection and all data that goes through it is compressed in a single stream. So when a message has the same byte sequence as a previous message the comrpessor ends up turning that into a reference to an already existing entry in a look-up table.

The hard(er) part for BEAM distribution and this sort of thing would be managing the size of the lookup table as these connections are meant to be both long-lived and not consume infinite resources ;) So unlike (relatively) short-lived and highly repetitive imap connections, this would probably require something custom made to task which would keep a cache of most used terms (with all that comes with that, including cache invalidation).

Compared to just working on the term serialization, that feels a bit like rocket science at the moment. But getting maps in the same message more efficiently packed is definitely doable :)

erlang-questions mailing list
[hidden email]

erlang-questions mailing list
[hidden email]