term_to_binary and large data structures

classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

term_to_binary and large data structures

Aaron Seigo
Hello! :)

I have a distributed (in the Erlang sense) application which often
produces moderately-sized maps (10k+ entries with lots of tuples in the
mix) which in the past have given inter-node message passing serious
problems: the vm would lock for a long while, use several GB of RAM, and
usually eventually give up. When it didn't outright crash, it would
produce message sizes too big to send between nodes, and/or the
heartbeat messages between nodes would time out resulting in breakage.
Running the same terms through `term_to_binary` produces similar
results.

The good news is that in OTP 21.0 things are quite a bit better:
serialization of the maps goes a lot quicker, memory usage is now only
~500MB per encoding for terms which would quickly balloon in the
multiple GB's, ... so there is progress and that is really fantastic.

Is 21.0 using something other than `term_to_binary` now for inter-node
messages?

Is it still using the "standard" external term format?

Where in the OTP source tree can one find the relevant code?

Cheers!

--
Aaron Seigo
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: term_to_binary and large data structures

zxq9-2
On 2018年6月27日水曜日 16時19分14秒 JST Aaron Seigo wrote:

> Hello! :)
>
> I have a distributed (in the Erlang sense) application which often
> produces moderately-sized maps (10k+ entries with lots of tuples in the
> mix) which in the past have given inter-node message passing serious
> problems: the vm would lock for a long while, use several GB of RAM, and
> usually eventually give up. When it didn't outright crash, it would
> produce message sizes too big to send between nodes, and/or the
> heartbeat messages between nodes would time out resulting in breakage.
> Running the same terms through `term_to_binary` produces similar
> results.

If you're using disterl keep in mind that inter-node heartbeats operate
on the same channel as intern-node messages.

If you send huge messages you are killing your heartbeats. That doesn't
play out very well.

The *correct* solution would be to make disterl operate over SCTP, but in
reality SCTP isn't supported well enough cross-platform to make for a
universal option (thanks, Microsoft).

Instead, consider opening a separate TCP connection between nodes for
tranfer of large data. That way heartbeats and small inter-node messages
don't interfere with your huge bulk message transfers.

(Yes, this is totally the kind of thing that there should be a library
for, but it isn't something there is a lot of time for unpaid work on.
That sucks, but there it is.)

-Craig
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: term_to_binary and large data structures

Jesper Louis Andersen-2
In reply to this post by Aaron Seigo
The map() type now has iterators, so you can gradually iterate over the map rather than having to convert it all at once. Maybe that is what is helping you.

However, I'd strongly recommend you start building up a scheme in which you chunk the large messages into smaller messages with some kind of continuation token. Large messages are bound to create trouble at some point. You can also toy with the idea of moving the code to the data rather than data to the code.

On Wed, Jun 27, 2018 at 4:19 PM Aaron Seigo <[hidden email]> wrote:
Hello! :)

I have a distributed (in the Erlang sense) application which often
produces moderately-sized maps (10k+ entries with lots of tuples in the
mix) which in the past have given inter-node message passing serious
problems: the vm would lock for a long while, use several GB of RAM, and
usually eventually give up. When it didn't outright crash, it would
produce message sizes too big to send between nodes, and/or the
heartbeat messages between nodes would time out resulting in breakage.
Running the same terms through `term_to_binary` produces similar
results.

The good news is that in OTP 21.0 things are quite a bit better:
serialization of the maps goes a lot quicker, memory usage is now only
~500MB per encoding for terms which would quickly balloon in the
multiple GB's, ... so there is progress and that is really fantastic.

Is 21.0 using something other than `term_to_binary` now for inter-node
messages?

Is it still using the "standard" external term format?

Where in the OTP source tree can one find the relevant code?

Cheers!

--
Aaron Seigo
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions


--
J.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: term_to_binary and large data structures

Aaron Seigo
On 2018-06-27 17:05, Jesper Louis Andersen wrote:
> The map() type now has iterators, so you can gradually iterate over the
> map rather
> than having to convert it all at once. Maybe that is what is helping
> you.

That could well be it.

> However, I'd strongly recommend you start building up a scheme in which
> you chunk the
> large messages into smaller messages with some kind of continuation
> token.

We already do. This does not, however, resolve the real issue which is
bandwidth usage. Chunking the messages just makes smaller bits of bloat.
The total bloat is exactly the same, however, and easily inundates Gbit
and even 10Gbit networking. Except now we have the _added overhead_ of
more messages.

It's merely a way to shuffle forward, not a path to anything scalable.

> Large messages are bound to create trouble at some point.

Yes, if unbounded, I would agree. However, that is not our case.

We have maps with 10k keys that strain this system and easily saturate
our network. This is not "big" by any modern definition. As a
demonstration of this to ourselves, I wrote an Elixir library that
serializes terms to a more space efficient format. Where
`term_to_binary` creates 500MB monsters, this library conveniently
creates a 1.5MB binary out of the exact same data.

In fact, for nearly any term you throw at it, this pretty simple
algorithm produces smaller serialized data. You can see the format
employed here:

      https://github.com/aseigo/packer/blob/develop/FORMAT.md

Given that it routinely produces results anywhere from 33% to 99% (!!)
smaller just shows how problematic the current external term format is.
Unfortunately, this is "only" an Elixir implementation and so is not
very fast at this point. The point of the exercise was to see what a
reasonable term serializer could produce, specifically to see if there
was any improvement to be had.

Apparently there is quite a bit. The external term format is widely
used, so improvements in it could be far reaching.

How difficult would it be to change the external term format, based on
e.g. the versioning in the distribution header? Would it be possible to
make term serialization pluggable as more and more of the the rest of
the distribution framework in the BEAM has become in v 21?

> You can also toy with the
> idea of moving the code to the data rather than data to the code.

Our goal is to distribute computation, so that would be
counter-productive.

--
Aaron
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: term_to_binary and large data structures

Aaron Seigo
In reply to this post by zxq9-2
On 2018-06-27 16:39, [hidden email] wrote:
> On 2018年6月27日水曜日 16時19分14秒 JST Aaron Seigo wrote:
> If you send huge messages you are killing your heartbeats. That doesn't
> play out very well.

Yes, we've noticed ;)

> Instead, consider opening a separate TCP connection between nodes for
> tranfer of large data. That way heartbeats and small inter-node
> messages
> don't interfere with your huge bulk message transfers.

That would be fine, and quite doable ... but it means everyone having
similar issues needs to do the same. Why are we using Erlang, again?
Yes, for ease of use of things like concurrency and distribution. When
those things break, the reason to use Erlang evaporates. It would be
better to address the issues (even if it takes a while and some effort
to do so!) then to work around it.

That said .. the recent work on top of 21 to allow out-of-order sending
of messages, including during chunked message transmission, to avoid
head-of-line blocking is really promising and should, once finalized and
upstreamed, resolve this particular issue.

> (Yes, this is totally the kind of thing that there should be a library
> for, but it isn't something there is a lot of time for unpaid work on.
> That sucks, but there it is.)

IMHO it should be part of the distribution system itself, one way or the
other, as using a library to work *around* distribution (and which
itself would need to follow distribution to be transparent) is a little
daft.

There are enough of us working on paid time these days to make it
happen, as long as we know what needs doing and what is acceptable. :)

--
Aaron
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: term_to_binary and large data structures

Chandru-4
In reply to this post by zxq9-2
On 27 June 2018 at 15:39, <[hidden email]> wrote:
On 2018年6月27日水曜日 16時19分14秒 JST Aaron Seigo wrote:
> Hello! :)
>
> I have a distributed (in the Erlang sense) application which often
> produces moderately-sized maps (10k+ entries with lots of tuples in the
> mix) which in the past have given inter-node message passing serious
> problems: the vm would lock for a long while, use several GB of RAM, and
> usually eventually give up. When it didn't outright crash, it would
> produce message sizes too big to send between nodes, and/or the
> heartbeat messages between nodes would time out resulting in breakage.
> Running the same terms through `term_to_binary` produces similar
> results.

If you're using disterl keep in mind that inter-node heartbeats operate
on the same channel as intern-node messages.

If you send huge messages you are killing your heartbeats. That doesn't
play out very well.

The *correct* solution would be to make disterl operate over SCTP, but in
reality SCTP isn't supported well enough cross-platform to make for a
universal option (thanks, Microsoft).

Instead, consider opening a separate TCP connection between nodes for
tranfer of large data. That way heartbeats and small inter-node messages
don't interfere with your huge bulk message transfers.

(Yes, this is totally the kind of thing that there should be a library
for, but it isn't something there is a lot of time for unpaid work on.
That sucks, but there it is.)


There are a few libraries which offer this functionality. Here are two I know of.



Chandru



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: term_to_binary and large data structures

Tristan Sloughter-4
In reply to this post by Aaron Seigo
> That said .. the recent work on top of 21 to allow out-of-order sending
> of messages, including during chunked message transmission, to avoid
> head-of-line blocking is really promising and should, once finalized and
> upstreamed, resolve this particular issue.

Yes, this is going to be great. And with the changes internally for 21 I was able to get my SCTP experiment working, https://github.com/tsloughter/sctp_dist 
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: term_to_binary and large data structures

zxq9-2
On 2018年6月27日水曜日 15時57分13秒 JST Tristan Sloughter wrote:
> > That said .. the recent work on top of 21 to allow out-of-order sending
> > of messages, including during chunked message transmission, to avoid
> > head-of-line blocking is really promising and should, once finalized and
> > upstreamed, resolve this particular issue.
>
> Yes, this is going to be great. And with the changes internally for 21 I
> was able to get my SCTP experiment working,
> https://github.com/tsloughter/sctp_dist 

It's like... the future!

-Craig
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: term_to_binary and large data structures

Fred Hebert-2
In reply to this post by Aaron Seigo
On 06/27, Aaron Seigo wrote:
>We have maps with 10k keys that strain this system and easily saturate
>our network. This is not "big" by any modern definition. As a
>demonstration of this to ourselves, I wrote an Elixir library that
>serializes terms to a more space efficient format. Where
>`term_to_binary` creates 500MB monsters, this library conveniently
>creates a 1.5MB binary out of the exact same data.
>

Have you tried comparing when `term_to_binary(Term, [{compressed, 9}])'?  
If you can pack 500MB of data down to 1.5 MB, chances are that
compression could do some good things on your end.

Regards,
Fred.
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: term_to_binary and large data structures

zxq9-2
On 2018年6月27日水曜日 19時14分10秒 JST Fred Hebert wrote:

> On 06/27, Aaron Seigo wrote:
> >We have maps with 10k keys that strain this system and easily saturate
> >our network. This is not "big" by any modern definition. As a
> >demonstration of this to ourselves, I wrote an Elixir library that
> >serializes terms to a more space efficient format. Where
> >`term_to_binary` creates 500MB monsters, this library conveniently
> >creates a 1.5MB binary out of the exact same data.
> >
>
> Have you tried comparing when `term_to_binary(Term, [{compressed, 9}])'?  
> If you can pack 500MB of data down to 1.5 MB, chances are that
> compression could do some good things on your end.

With ratios like that it is also highly likely that the bulk of the data
is not actually information...

-Craig
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: term_to_binary and large data structures

Michael Truog
In reply to this post by Aaron Seigo
On 06/27/2018 07:19 AM, Aaron Seigo wrote:
> I have a distributed (in the Erlang sense) application which often produces moderately-sized maps (10k+ entries with lots of tuples in the mix) which in the past have given inter-node message passing serious problems: the vm would lock for a long while, use several GB of RAM, and usually eventually give up. When it didn't outright crash, it would produce message sizes too big to send between nodes, and/or the heartbeat messages between nodes would time out resulting in breakage. Running the same terms through `term_to_binary` produces similar results.
>
> The good news is that in OTP 21.0 things are quite a bit better: serialization of the maps goes a lot quicker, memory usage is now only ~500MB per encoding for terms which would quickly balloon in the multiple GB's, ... so there is progress and that is really fantastic.
>

Part of what you may be seeing is the amount of memory allocated for the receiver of a distributed Erlang message that contains a large Erlang map because of the need to over-estimate the total size of the map at https://github.com/erlang/otp/blob/f3790140d0e73f257c78d67de894b606ef53a8e5/erts/emulator/beam/erl_map.h#L196-L197 (not sure if other related logic changed recently, others may want to comment on that, if that is the case).

Your messages sound big enough that you may want to consider switching to a less-dynamic binary format, if that is possible with your usage of the data, to minimize the potential memory consumption.  Get/Put operations on a binary are very slow (e.g., at https://github.com/okeuday/blookup) though it may help you deal with the memory usage in a simpler way.

Best Regards,
Michael


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: term_to_binary and large data structures

Aaron Seigo
In reply to this post by Fred Hebert-2
On 2018-06-28 01:14, Fred Hebert wrote:

> On 06/27, Aaron Seigo wrote:
>> We have maps with 10k keys that strain this system and easily saturate
>> our network. This is not "big" by any modern definition. As a
>> demonstration of this to ourselves, I wrote an Elixir library that
>> serializes terms to a more space efficient format. Where
>> `term_to_binary` creates 500MB monsters, this library conveniently
>> creates a 1.5MB binary out of the exact same data.
>>
>
> Have you tried comparing when `term_to_binary(Term, [{compressed,
> 9}])'?  If you can pack 500MB of data down to 1.5 MB, chances are that
> compression could do some good things on your end.

Yes, and it certainly helps but it is still larger than one would hope
for (and larger than what that POC produces), but most importantly this
only is meaningful when we control the call to `term_to_binary`. When it
is hidden behind code in OTP or a library, or an equivalent function is
generating an external term format binary, we don't get to use this
trick.

Which also brings us to the fact that the compression being used is
still zlib, while there are much better options out there. That POC
implementation uses zstd which is both faster and produces smaller
binaries than zlib.

--
Aaron
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: term_to_binary and large data structures

Aaron Seigo
In reply to this post by Michael Truog
On 2018-06-28 02:45, Michael Truog wrote:

> On 06/27/2018 07:19 AM, Aaron Seigo wrote:
>> I have a distributed (in the Erlang sense) application which often
>> produces moderately-sized maps (10k+ entries with lots of tuples in
>> the mix) which in the past have given inter-node message passing
>> serious problems: the vm would lock for a long while, use several GB
>> of RAM, and usually eventually give up. When it didn't outright crash,
>> it would produce message sizes too big to send between nodes, and/or
>> the heartbeat messages between nodes would time out resulting in
>> breakage. Running the same terms through `term_to_binary` produces
>> similar results.
>>
>> The good news is that in OTP 21.0 things are quite a bit better:
>> serialization of the maps goes a lot quicker, memory usage is now only
>> ~500MB per encoding for terms which would quickly balloon in the
>> multiple GB's, ... so there is progress and that is really fantastic.
>>
>
> Part of what you may be seeing is the amount of memory allocated for
> the receiver of a distributed Erlang message that contains a large
> Erlang map because of the need to over-estimate the total size of the
> map at

Those numbers were all from the sending side; the maps don't seem to be
an issue on deserialization.

> messages sound big enough that you may want to consider switching
> to a less-dynamic binary format,

Everything is possible ;) but not everything is palatable. The maps are
generated in part by NIFs so stepping outside the standard set of data
structures becomes more difficult, and for our use cases maps are the
"right" data structure not just to represent the data but more
importantly to work with it.

I'm not a fan of working around a problem when the cause and location of
it is easily noted. It would be much nicer to improve the serialization
of data for messages, not only for our needs, but since it would
positively impact every user of the BEAM for distribution.

Thanks for the pointer to blookup, though; neat approach. We don't
really have the issue of usage between processes, though, as much as we
do between nodes. So reference counting can't really help us :)

--
Aaron
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: term_to_binary and large data structures

Aaron Seigo
In reply to this post by Aaron Seigo
On 2018-06-28 07:03, Dmitry Belyaev wrote:
> So it's not so bad as it's stated in the file, only 33 time worse than
> the advertised
> format.

My bad; this came from a run of performance tests where the size of them
map is increased incrementally. It is missing a zero in one of the
lines; will fix.

> However after applying [compressed: 9] suggested by Fred Hebert, I see:
>
> iex(6)> :erlang.term_to_binary(tuple_map, compressed: 9) |> byte_size()
> |> (fn x -> x / 1024 / 1024 end).()
> 0.38570117950439453

There are three problems with this:

a) that data does get decompressed at some point. It doesn't magically
go away. It does, however, help with network usage.

b) we trade time (to compress) for space (fewer byte to transmit), when
that space isn't needed in the first place. The time required to
compress, esp relative to the serialization time, is decidedly
non-trivial. It takes several times as long to compress with zlib than
it does to serialize. This could be improved by moving a more modern
compression algorithm, though the cost is always non-zero of course. In
our tests, it definitely paid to compress the actual data in the map,
but there was very little need to compress the structural metadata when
encoded efficiently.

c) we don't always control the call to term_to_binary, or the equivalent
eternal term generators, and so don't have access to compression e.g. on
distribution messages

I suppose we could propose using compression on (larger) distribution
messages, which would help with the network saturation, and would be a
better stop-gap than nothing, but it still leaves us with (a) and (b)
above (and , and (c) everywhere else.

--
Aaron
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: term_to_binary and large data structures

Jesper Louis Andersen-2
In reply to this post by Aaron Seigo
On Wed, Jun 27, 2018 at 11:24 PM Aaron Seigo <[hidden email]> wrote:
In fact, for nearly any term you throw at it, this pretty simple
algorithm produces smaller serialized data. You can see the format
employed here:

      https://github.com/aseigo/packer/blob/develop/FORMAT.md


​That is also a format with different properties. The external term format doubles as an on-disk format where you might have the need to be robust against a few bad blocks. Schema-based formats tend to be worse than bloaty-prefix-encoded formats here. It probably hurts Elixir more since the map() type is the underlying standard type for many things whereas a record in Erlang is a tuple with some compile time expansion on top. In short, keys are not repeated in this format.

You might want to look into Joe Armstrong's UBF stack in which you define the schema as a small stack engine and then interpret that engine to produce data. It serves as a hybrid in which you have a schema, but since the stack engine supports a duplicate-instruction, you can repeat keys in maps if they are the same and so on. In turn, you still have a prefix-like encoding, but it compresses far better for schemes where you have many repeated keys.

If you want to have a header-schema, it is probably worth it to just take Protobuf3 and see how well that format handles the data. It has an encoding scheme, varints and ZigZag encoding which represents integers in a way which makes small integers small in the data stream, and also compresses well. So for real-world data, this encoding tend to win.

​Ephemeral data transfer between nodes could benefit from having a new format, or an update which packs maps better.

emulator/beam/external.c:BIF_RETTYPE term_to_binary_1(BIF_ALIST_1)

​is the place you want to start looking. Be cautious of the following caveats:

* You must write your code so it can be preempted (see the trap variants)
* The distribution protocol is a bit different and has an atom-cache for common atoms. I'm not entirely sure this is the entry-point for it
​* We need backwards compatibility. In many cases even small changes to this format has proven to be riddled with loss of compatibility.
* We might want to rethink ordering properties of the produced binary. I know this has been a want historically, but I'm not sure we should grant that wish :P
* For distribution: More plugability would be really cool to have.

Finally, as for the goal of distributing computation: distribute data is still my advice. If data is distributed, computation distributes trivially. Moving data around is going to be a major bottleneck going forward, and the more data you amass, the more you are going to pay moving that data around. Better distribution formats just shaves a constant factor, so you eventually hit the same bottleneck in the long run. The other problem is that any kind of shared/copied data requires locking or coordination. Anything involving those parallelizes badly.

--
J.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: term_to_binary and large data structures

Aaron Seigo
On 2018-06-28 12:53, Jesper Louis Andersen wrote:

> On Wed, Jun 27, 2018 at 11:24 PM Aaron Seigo <[hidden email]>
> wrote:
>
>> In fact, for nearly any term you throw at it, this pretty simple
>> algorithm produces smaller serialized data. You can see the format
>> employed here:
>>
>> https://github.com/aseigo/packer/blob/develop/FORMAT.md
>
> ​That is also a format with different properties. The external term
> format doubles as
> an on-disk format where you might have the need to be robust against a
> few bad blocks.

There are ways to achieve that with schema-based formats, as well, of
course... but, yes, as it currently stands the format above is not great
for a disk format, though I would argue that external term format (ETF?)
:) is also rather lacking for use on-disk unless one adds to the base
format.

> Schema-based formats tend to be worse than bloaty-prefix-encoded
> formats here. It
> probably hurts Elixir more since the map() type is the underlying
> standard type for
> many things whereas a record in Erlang is a tuple with some compile
> time expansion on
> top. In short, keys are not repeated in this format.

Yes, records have some benefits indeed. However, the problem is not just
maps. In fact, tuples are a big problem themselves. The ETF uses one
byte for the type code (104 for a tuple with 0-255; 108 for larger
tuples) and either a 1-byte or 4-byte integer containing the arity. That
is 1 byte more than necessary for small tuples, and 4 extra bytes for
larger tuples .. which is made more odd by the fact that tuples can only
have 2^24 elements maximum; reading whole words is theoretically faster,
so perhaps that was the motivation to tack on an extra byte ...

Still, in a common case it takes 2 bytes to encode an empty (!) tuple,
and 2 bytes to encode a small (the common case ime..) tuple. So a tuple
like {1,2} requires 6 bytes to encode, {1,2,3} takes 8 ... with the
compressed metadata approach this becomes 6 bytes for either of those
tuples. Stuff a list with a bunch of tuples and the overhead adds up
amazingly quickly: [{1, 2}, {3, 4}, {5, 6}] takes 25 bytes in the ETF,
and 18 with the packer library. add one more similar tuple and we end up
with 31 (!) bytes for ETF, and 20 with packer.

There are a wide range of inefficiencies in the ETF, so hopefully we
don't get sidetracked by maps alone.

(In our use case, it was tuples killing us, not maps!)

> You might want to look into Joe Armstrong's UBF stack in which you
> define the schema
> as a small stack engine and then interpret that engine to produce data.
> It serves as a
> hybrid in which you have a schema, but since the stack engine supports
> a duplicate-
> instruction, you can repeat keys in maps if they are the same and so
> on. In turn, you

Repeating data is easily, and usually more efficiently, compressed using
a traditional compression tool. Packer uses zstd and it does a lovely
job of this. It is really the metadata encoding ...

> If you want to have a header-schema, it is probably worth it to just
> take Protobuf3
> and see how well that format handles the data.

There are 2 major problems with using this sort of solution:

  * none (that we are aware of) supports Erlang's range of data types;
tuples in particular. So you are back to being "on your own". I would
love to have had an OTS solution :)

  * we can not (for reasons of practicality) always know the type of data
we will want to encode for a message, and so it is not possible to
create the sort of schema definitions that these systems want (and which
is an important source of their efficiency and safety).

So while they 100% make sense in the right use cases, they are not a
good fit for a general purpose erlang term serializer.

> It has an encoding scheme, varints and ZigZag encoding which represents
> integers in a > way which makes small integers small in the data
> stream, and also compresses well. So
> for real-world data, this encoding tend to win.

Packer does the small integers thing as well :)

> ​Ephemeral data transfer between nodes could benefit from having a new
> format, or an
> update which packs maps better.

... and tuples ;)

Given that Packer does an equivalent or better job in terms of resulting
number of bytes used in nearly every single term (and most of the
remaining exceptions are addressable with simple additions, such as
having a type code for empty list, list of small number, ala ETF), it is
clear it is not just maps.

> emulator/beam/external.c:BIF_RETTYPE term_to_binary_1(BIF_ALIST_1)

cool; will look .. but the reason I wrote to the list first was in hopes
to find out if there was appetite for such an improvement. IOW: are the
odds of a quality implementation being upstreamed high enough to warrant
our effort there :)

> * You must write your code so it can be preempted (see the trap
> variants)

Yep ...

> * The distribution protocol is a bit different and has an atom-cache
> for common atoms.

Yes, this is one place the distribution protocol has a leg up on packer
due to access to the atom cache.

> I'm not entirely sure this is the entry-point for it

Even if not, messages for distribution should utilize that cache?

> ​* We need backwards compatibility. In many cases even small changes to
> this format has
> proven to be riddled with loss of compatibility.

My hopes go as follows:

Use versioning in the distribution header to indicate that a newer
serialization format is used; fall back to the existing format when that
is not indicated. As there would be no change in data types expressable,
this should only alter the representation of the data on the wire.

> * We might want to rethink ordering properties of the produced binary.
> I know this has
> been a want historically, but I'm not sure we should grant that wish :P

Do you have pointers to discussions on this so I can understand what you
mean by "ordering properties" better? (Bug report, mailing list thread,
etc.?)

> * For distribution: More plugability would be really cool to have.

Agreed :)

> Finally, as for the goal of distributing computation: distribute data
> is still my
> advice. If data is distributed, computation distributes trivially.

Unfortunately for us, our compute is expensive relative to the cost of
the data. A small amount of data ends up creating a large amount of
computation, which produces other not-very-big sets of data (I don't
consider 100k pairs to be particularly large, esp when that can be
serialized to <2MB, just as I don't think anyone seriously considers a
1GB SQL database to be large; but maybe that's just my perspective :)
which in turns kicks off even more computation ... ... . the cost of
computation is such that we need many machines working together on
shared problems.

Our problem occupies a space somewhere between more typical Erlang use
cases (messaging, e.g.) and scientific computing, as it is neither but
has some similar properties to both.

As a result, for us it isn't enough to keep data local and move
computation to the data.

> Better distribution formats just shaves a constant factor

Indeed; however currently that is a MASSIVE constant. With the data we
have, it means the difference between 2.5mps (messages/second) on a
10Gbe networking verses 833mps. So the gap between the current ceiling
and the new ceiling would be massive. Long term, you are absolutely
correct, though, when you say:

> so you eventually hit the same bottleneck in the long run. The

Pre-compute sharding of the data space is an approach we use to avoid,
or at least significantly delay beyond our requirements, the onset of
that bottleneck ...

> problem is that any kind of shared/copied data requires locking or
> coordination.
> Anything involving those parallelizes badly. --

This is not a property of the data in this application, other than the
job queues, and that's only because we have those distributed for the
purposes of durability. They also represent ~0% of the run time in the
system, so the overhead there is ok.

Thanks for the insights there, though, as they are .. well .. insightful
;) and bang-on for most applications.

Cheers,

--
Aaron
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: term_to_binary and large data structures

Benoit Chesneau-2
In reply to this post by Chandru-4
also https://gitlab.com/barrel-db/teleport



On Wednesday, June 27, 2018, Chandru <[hidden email]> wrote:
On 27 June 2018 at 15:39, <[hidden email]> wrote:
On 2018年6月27日水曜日 16時19分14秒 JST Aaron Seigo wrote:
> Hello! :)
>
> I have a distributed (in the Erlang sense) application which often
> produces moderately-sized maps (10k+ entries with lots of tuples in the
> mix) which in the past have given inter-node message passing serious
> problems: the vm would lock for a long while, use several GB of RAM, and
> usually eventually give up. When it didn't outright crash, it would
> produce message sizes too big to send between nodes, and/or the
> heartbeat messages between nodes would time out resulting in breakage.
> Running the same terms through `term_to_binary` produces similar
> results.

If you're using disterl keep in mind that inter-node heartbeats operate
on the same channel as intern-node messages.

If you send huge messages you are killing your heartbeats. That doesn't
play out very well.

The *correct* solution would be to make disterl operate over SCTP, but in
reality SCTP isn't supported well enough cross-platform to make for a
universal option (thanks, Microsoft).

Instead, consider opening a separate TCP connection between nodes for
tranfer of large data. That way heartbeats and small inter-node messages
don't interfere with your huge bulk message transfers.

(Yes, this is totally the kind of thing that there should be a library
for, but it isn't something there is a lot of time for unpaid work on.
That sucks, but there it is.)


There are a few libraries which offer this functionality. Here are two I know of.



Chandru




--
Sent from my Mobile

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: term_to_binary and large data structures

t ty
also Ethernet Bonding.


On Wed, Jul 4, 2018 at 11:49 AM, Benoit Chesneau <[hidden email]> wrote:
also https://gitlab.com/barrel-db/teleport



On Wednesday, June 27, 2018, Chandru <[hidden email]> wrote:
On 27 June 2018 at 15:39, <[hidden email]> wrote:
On 2018年6月27日水曜日 16時19分14秒 JST Aaron Seigo wrote:
> Hello! :)
>
> I have a distributed (in the Erlang sense) application which often
> produces moderately-sized maps (10k+ entries with lots of tuples in the
> mix) which in the past have given inter-node message passing serious
> problems: the vm would lock for a long while, use several GB of RAM, and
> usually eventually give up. When it didn't outright crash, it would
> produce message sizes too big to send between nodes, and/or the
> heartbeat messages between nodes would time out resulting in breakage.
> Running the same terms through `term_to_binary` produces similar
> results.

If you're using disterl keep in mind that inter-node heartbeats operate
on the same channel as intern-node messages.

If you send huge messages you are killing your heartbeats. That doesn't
play out very well.

The *correct* solution would be to make disterl operate over SCTP, but in
reality SCTP isn't supported well enough cross-platform to make for a
universal option (thanks, Microsoft).

Instead, consider opening a separate TCP connection between nodes for
tranfer of large data. That way heartbeats and small inter-node messages
don't interfere with your huge bulk message transfers.

(Yes, this is totally the kind of thing that there should be a library
for, but it isn't something there is a lot of time for unpaid work on.
That sucks, but there it is.)


There are a few libraries which offer this functionality. Here are two I know of.



Chandru




--
Sent from my Mobile

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: term_to_binary and large data structures

Michał Muskała
In reply to this post by Aaron Seigo
I also believe the current format for maps, which is key1, value1, key2, value2, ... is not that great for compression. Often, you'd have maps with exact the same keys (especially in Elixir with structs), and there, a pattern of key1, key2, ..., value1, value2, ..., should be much better (since the entire keys structure could be compressed between similar maps).

Michał.
On 28 Jun 2018, 07:36 +0200, Aaron Seigo <[hidden email]>, wrote:
On 2018-06-28 07:03, Dmitry Belyaev wrote:
So it's not so bad as it's stated in the file, only 33 time worse than
the advertised
format.

My bad; this came from a run of performance tests where the size of them
map is increased incrementally. It is missing a zero in one of the
lines; will fix.

However after applying [compressed: 9] suggested by Fred Hebert, I see:

iex(6)> :erlang.term_to_binary(tuple_map, compressed: 9) |> byte_size()
|> (fn x -> x / 1024 / 1024 end).()
0.38570117950439453

There are three problems with this:

a) that data does get decompressed at some point. It doesn't magically
go away. It does, however, help with network usage.

b) we trade time (to compress) for space (fewer byte to transmit), when
that space isn't needed in the first place. The time required to
compress, esp relative to the serialization time, is decidedly
non-trivial. It takes several times as long to compress with zlib than
it does to serialize. This could be improved by moving a more modern
compression algorithm, though the cost is always non-zero of course. In
our tests, it definitely paid to compress the actual data in the map,
but there was very little need to compress the structural metadata when
encoded efficiently.

c) we don't always control the call to term_to_binary, or the equivalent
eternal term generators, and so don't have access to compression e.g. on
distribution messages

I suppose we could propose using compression on (larger) distribution
messages, which would help with the network saturation, and would be a
better stop-gap than nothing, but it still leaves us with (a) and (b)
above (and , and (c) everywhere else.

--
Aaron
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: term_to_binary and large data structures

Aaron Seigo
On 2018-07-04 13:23, Michał Muskała wrote:

> I also believe the current format for maps, which is key1, value1,
> key2, value2, ... is
> not that great for compression. Often, you'd have maps with exact the
> same keys
> (especially in Elixir with structs), and there, a pattern of key1,
> key2, ..., value1,
> value2, ..., should be much better (since the entire keys structure
> could be compressed
> between similar maps).

I can confirm that this is an accurate observation. While not done in
Packer, there are notes about this in Packer's code which was the result
of some experiments around this. For maps, and *especially* structs in
Elixir, this can indeed be a huge win for some messages.

Even more farther afield: what would be a real win, but much harder to
accomplish, would be streaming compression. There are protocols (e.g.
imap) which can offload compression of common patterns between messages
to entries in the compression look up tables. The compression is applied
to the entire network stream for the life of the connection and all data
that goes through it is compressed in a single stream. So when a message
has the same byte sequence as a previous message the comrpessor ends up
turning that into a reference to an already existing entry in a look-up
table.

The hard(er) part for BEAM distribution and this sort of thing would be
managing the size of the lookup table as these connections are meant to
be both long-lived and not consume infinite resources ;) So unlike
(relatively) short-lived and highly repetitive imap connections, this
would probably require something custom made to task which would keep a
cache of most used terms (with all that comes with that, including cache
invalidation).

Compared to just working on the term serialization, that feels a bit
like rocket science at the moment. But getting maps in the same message
more efficiently packed is definitely doable :)

--
Aaron
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
12