Binary, List and Tuple Inequalities (Paradox?)

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Binary, List and Tuple Inequalities (Paradox?)

Valentin Micic-6
Hi all, 

Consider the following inequalities:

(tsdb_1_1@macbookv-3)2876> <<0,0,1>> < <<1,0,0>>.
true

As well as:

(tsdb_1_1@macbookv-3)2878> [0,0,1] < [1,0,0].
true

The result (true) makes sense — in both cases operands are of the same size, and the first (leftmost) element value of the left operand is less than  first (again, leftmost) element value of the right operand.

However:

(tsdb_1_1@macbookv-3)2877> <<0,0,1>> < <<1,0>>.   
true

and

(tsdb_1_1@macbookv-3)2879> [0,0,1] < [1,0].  
true

This indicates that the actual length of the operands are not considered, for clearly, three octets cannot be less then two, right?

This becomes even more confusing if you check how tuples are compared given the similar test-case:

(tsdb_1_1@macbookv-3)2886> {0,0,1} < {1,0,0}.  
true
(tsdb_1_1@macbookv-3)2887> {0,0,1} < {1,0}.  
false

Here, the number of elements in the tuple are clearly taken into the consideration.

Then, if one converts all three pairs of operands to its (external) binary representation, e.g. 

Binary:

(tsdb_1_1@macbookv-3)2916> term_to_binary( <<0,0,1>> ).
<<131,109,0,0,0,3,0,0,1>>
(tsdb_1_1@macbookv-3)2917> term_to_binary( <<1,0>> ).  
<<131,109,0,0,0,2,1,0>>

List:

tsdb_1_1@macbookv-3)2918> term_to_binary( [0,0,1] ).  
<<131,107,0,3,0,0,1>>
(tsdb_1_1@macbookv-3)2919> term_to_binary( [1,0] ).  
<<131,107,0,2,1,0>>

Tuple:
(tsdb_1_1@macbookv-3)2920> term_to_binary( {0,0,1} ).
<<131,104,3,97,0,97,0,97,1>>
(tsdb_1_1@macbookv-3)2921> term_to_binary( {1,0} ).  
<<131,104,2,97,1,97,0>>

One could see that the number of “elements” are known and available for comparison (I’ve highlighted this number using bold, yellow font) for all three situations.

And yet, the different approaches between binary and lists on one side, and tuples, on another, appear to be deliberate as binary and list types are following the approach used by C standard library memcmp function, whereas tuples do not, e.g.

int memcmp(const void *s1, const void *s2, size_t n);
In other words, operands are reduced to some common size ’n', and then compared as if they were equal in length. 
I do understand this restriction in C -- if not for ’n’, the argument sizes are not known, and without it the execution may end up in a catastrophic failure (e.g. segment violation, hmm.. which can still happen anyway if ’n’ is inadequate).

So, THE QUESTION:

Why would it be wrong to consider inequalities for binary() data types the same way we do it for tuples — number of “elements” first and  “numeric values” of individual elements — second?

In my view, this is not only not wrong, but it would be more logical, and this is why.

Look at the following comparison:

(tsdb_1_1@macbookv-3)2957> <<0,0,1>> < <<0,0,1,0>>.
true


Well, this make sense, for less-is-less (e.g. three octets are less than four octets) and even if one considers  <<0,0,1>> as a 24-bit integer value, and <<0,0,1,0>> as a 32-bit integer, one may “normalise” the 24-bit integer to its 32-bit equivalent by adding a leading zero, and the comparison expression would still return true:

(tsdb_1_1@macbookv-3)2961> <<0,0,0,1>> < <<0,0,1,0>>.   
true

Therefore, as expected, 1 <  256, and, thus, we may be forgiven if we have the same expectation when we write:

(tsdb_1_1@macbookv-3)2958> <<1>> < <<0,0,1,0>>. 
 
Right? Well, not so, because:

(tsdb_1_1@macbookv-3)2958> <<1>> < <<0,0,1,0>>.    
false

Thus, "less is more", and therefore 1 appears to be greater-than-or-equal-to  256. Somehow.
 
This could never happen if one considered a number of elements (in this case number of octets) before attempting to compare the individual element values.

But wait, the "less-is-more" may easily become its opposite — “more is less”, because:

(tsdb_1_1@macbookv-3)2973> <<32, 97>> < <<97>>.
true

Yes, this approach may help us to sort a list of, say,  surnames alphabetically, so we can have all the “Ms” before the “Ns”, regardless of how long the surname is, but is this actually logical for two arbitrary binaries? Seeing that we actually care about number of elements when we deal with tuples, why not extend the same approach to arbitrary binary values?

Why do we presume that binaries are given as STRINGS when we compare them? When it comes to lists, the situation is even worse, because lists do, but then do not follow the STRING comparison semantic:

(tsdb_1_1@macbookv-3)2991> [32, 97] < [97].  
true

*** But! ***

(tsdb_1_1@macbookv-3)2992> [{32, 97}] < [97].
false

(tsdb_1_1@macbookv-3)2993> [{32, 97}] < [{97}].
false

And just to be clear, I do not expect Erlang to change. Nor do I care about how lists are handled (well, at least not for the moment).

My main concern is semantic behind comparing two arbitrary binary() operands. For one of my projects I’d like to employ tuple-like semantic (.e.g. number of octets first, values later) when comparing two arbitrary binaries and thus avoid “less-is-more” and “more-is-less” paradoxes.

Is there any reason why I shouldn’t? (Okay, it may be a bit slower, but one may argue this approach may be even a bit faster for a long-enough binaries. But what else?)

At any rate, I would appreciate your reasoning on the topic. What resonates better with you?


Thanks in advance.

V/


Reply | Threaded
Open this post in threaded view
|

Re: Binary, List and Tuple Inequalities (Paradox?)

Dmitry Belyaev-3
To me alphabetical ordering of lists (as legacy string representation) and binaries (as modern string representation) is natural, so it is expected that "aa" < "b" < "bb".

Not sure about binaries ordering which first checks the binary size, like "b" < "aa" < "bb" - I personally cannot find a suitable application for such ordering.

And about the lists. Their length is not known as for binaries. It can only be calculated with linear complexity. If you have a list 'aaa..a' of 50 elements and a list 'bbb..b' of 49, you'd have to make 49*2 iterations to determine which one is less. With the current implementation it is just 1 comparison.

On 26 October 2019 1:36:59 am AEDT, Valentin Micic <[hidden email]> wrote:
Hi all, 

Consider the following inequalities:

(tsdb_1_1@macbookv-3)2876> <<0,0,1>> < <<1,0,0>>.
true

As well as:

(tsdb_1_1@macbookv-3)2878> [0,0,1] < [1,0,0].
true

The result (true) makes sense — in both cases operands are of the same size, and the first (leftmost) element value of the left operand is less than  first (again, leftmost) element value of the right operand.

However:

(tsdb_1_1@macbookv-3)2877> <<0,0,1>> < <<1,0>>.   
true

and

(tsdb_1_1@macbookv-3)2879> [0,0,1] < [1,0].  
true

This indicates that the actual length of the operands are not considered, for clearly, three octets cannot be less then two, right?

This becomes even more confusing if you check how tuples are compared given the similar test-case:

(tsdb_1_1@macbookv-3)2886> {0,0,1} < {1,0,0}.  
true
(tsdb_1_1@macbookv-3)2887> {0,0,1} < {1,0}.  
false

Here, the number of elements in the tuple are clearly taken into the consideration.

Then, if one converts all three pairs of operands to its (external) binary representation, e.g. 

Binary:

(tsdb_1_1@macbookv-3)2916> term_to_binary( <<0,0,1>> ).
<<131,109,0,0,0,3,0,0,1>>
(tsdb_1_1@macbookv-3)2917> term_to_binary( <<1,0>> ).  
<<131,109,0,0,0,2,1,0>>

List:

tsdb_1_1@macbookv-3)2918> term_to_binary( [0,0,1] ).  
<<131,107,0,3,0,0,1>>
(tsdb_1_1@macbookv-3)2919> term_to_binary( [1,0] ).  
<<131,107,0,2,1,0>>

Tuple:
(tsdb_1_1@macbookv-3)2920> term_to_binary( {0,0,1} ).
<<131,104,3,97,0,97,0,97,1>>
(tsdb_1_1@macbookv-3)2921> term_to_binary( {1,0} ).  
<<131,104,2,97,1,97,0>>

One could see that the number of “elements” are known and available for comparison (I’ve highlighted this number using bold, yellow font) for all three situations.

And yet, the different approaches between binary and lists on one side, and tuples, on another, appear to be deliberate as binary and list types are following the approach used by C standard library memcmp function, whereas tuples do not, e.g.

int memcmp(const void *s1, const void *s2, size_t n);
In other words, operands are reduced to some common size ’n', and then compared as if they were equal in length. 
I do understand this restriction in C -- if not for ’n’, the argument sizes are not known, and without it the execution may end up in a catastrophic failure (e.g. segment violation, hmm.. which can still happen anyway if ’n’ is inadequate).

So, THE QUESTION:

Why would it be wrong to consider inequalities for binary() data types the same way we do it for tuples — number of “elements” first and  “numeric values” of individual elements — second?

In my view, this is not only not wrong, but it would be more logical, and this is why.

Look at the following comparison:

(tsdb_1_1@macbookv-3)2957> <<0,0,1>> < <<0,0,1,0>>.
true


Well, this make sense, for less-is-less (e.g. three octets are less than four octets) and even if one considers  <<0,0,1>> as a 24-bit integer value, and <<0,0,1,0>> as a 32-bit integer, one may “normalise” the 24-bit integer to its 32-bit equivalent by adding a leading zero, and the comparison expression would still return true:

(tsdb_1_1@macbookv-3)2961> <<0,0,0,1>> < <<0,0,1,0>>.   
true

Therefore, as expected, 1 <  256, and, thus, we may be forgiven if we have the same expectation when we write:

(tsdb_1_1@macbookv-3)2958> <<1>> < <<0,0,1,0>>. 
 
Right? Well, not so, because:

(tsdb_1_1@macbookv-3)2958> <<1>> < <<0,0,1,0>>.    
false

Thus, "less is more", and therefore 1 appears to be greater-than-or-equal-to  256. Somehow.
 
This could never happen if one considered a number of elements (in this case number of octets) before attempting to compare the individual element values.

But wait, the "less-is-more" may easily become its opposite — “more is less”, because:

(tsdb_1_1@macbookv-3)2973> <<32, 97>> < <<97>>.
true

Yes, this approach may help us to sort a list of, say,  surnames alphabetically, so we can have all the “Ms” before the “Ns”, regardless of how long the surname is, but is this actually logical for two arbitrary binaries? Seeing that we actually care about number of elements when we deal with tuples, why not extend the same approach to arbitrary binary values?

Why do we presume that binaries are given as STRINGS when we compare them? When it comes to lists, the situation is even worse, because lists do, but then do not follow the STRING comparison semantic:

(tsdb_1_1@macbookv-3)2991> [32, 97] < [97].  
true

*** But! ***

(tsdb_1_1@macbookv-3)2992> [{32, 97}] < [97].
false

(tsdb_1_1@macbookv-3)2993> [{32, 97}] < [{97}].
false

And just to be clear, I do not expect Erlang to change. Nor do I care about how lists are handled (well, at least not for the moment).

My main concern is semantic behind comparing two arbitrary binary() operands. For one of my projects I’d like to employ tuple-like semantic (.e.g. number of octets first, values later) when comparing two arbitrary binaries and thus avoid “less-is-more” and “more-is-less” paradoxes.

Is there any reason why I shouldn’t? (Okay, it may be a bit slower, but one may argue this approach may be even a bit faster for a long-enough binaries. But what else?)

At any rate, I would appreciate your reasoning on the topic. What resonates better with you?


Thanks in advance.

V/



--
Kind regards,
Dmitry Belyaev
Reply | Threaded
Open this post in threaded view
|

Re: Binary, List and Tuple Inequalities (Paradox?)

Fred Hebert-2


On Fri, Oct 25, 2019 at 11:35 AM Dmitry Belyaev <[hidden email]> wrote:
To me alphabetical ordering of lists (as legacy string representation) and binaries (as modern string representation) is natural, so it is expected that "aa" < "b" < "bb".

As a general note, this is only true to the extent that you don't care that "ZOOM" is < than "apple" in this dictionary. It is anything but natural, I would say it's probably more "accidental". Actual string sorting tends to rely on a Unicode collation algorithm (a part of the spec that is still missing in the stdlib), since even string normalization won't be enough to ensure things are done safely with regards to human expectations. At this point only libraries such as ux support it well.

Most of the orderings we have right now are often arbitrary and can have surprising sides to them. At this point for most data structures the important thing is not that the ordering is correct, but that at least one exists at all; this allows the creation of a bunch of data structures (usually trees and variants) to work on all data types, mixed or not. All atoms are smaller than all strings, even if atoms have a string component to them, for example.
Reply | Threaded
Open this post in threaded view
|

Re: Binary, List and Tuple Inequalities (Paradox?)

Valentin Micic-6
Thanks for the replies… unsatisfactory as they me have been :-)

Kind regards

V/

On 25 Oct 2019, at 17:43, Fred Hebert <[hidden email]> wrote:



On Fri, Oct 25, 2019 at 11:35 AM Dmitry Belyaev <[hidden email]> wrote:
To me alphabetical ordering of lists (as legacy string representation) and binaries (as modern string representation) is natural, so it is expected that "aa" < "b" < "bb".

As a general note, this is only true to the extent that you don't care that "ZOOM" is < than "apple" in this dictionary. It is anything but natural, I would say it's probably more "accidental". Actual string sorting tends to rely on a Unicode collation algorithm (a part of the spec that is still missing in the stdlib), since even string normalization won't be enough to ensure things are done safely with regards to human expectations. At this point only libraries such as ux support it well.

Most of the orderings we have right now are often arbitrary and can have surprising sides to them. At this point for most data structures the important thing is not that the ordering is correct, but that at least one exists at all; this allows the creation of a bunch of data structures (usually trees and variants) to work on all data types, mixed or not. All atoms are smaller than all strings, even if atoms have a string component to them, for example.

Reply | Threaded
Open this post in threaded view
|

Re: Binary, List and Tuple Inequalities (Paradox?)

Raimo Niskanen-13
In reply to this post by Valentin Micic-6
I'd like to defend the current term order of lists, tuples and
binaries.

Lists are compared a'la string comparison, which avoids having to
traverse the whole list when comparing.  Binaries are compared as lists
since both are used for handling strings, and also to be compared like
memcmp does it.

Tuples are used for general complex terms.  They can be compared size
first whithout having to be fully traversed, as opposed to lists, which
is useful in other contexts.  You can thereby choose your data
structure to select comparision.
--
Raimo Niskanen

Reply | Threaded
Open this post in threaded view
|

Re: Binary, List and Tuple Inequalities (Paradox?)

zxq9-2
On 2019年10月28日 月曜日 08:02:13 Raimo Niskanen wrote:
> Lists are compared a'la string comparison, which avoids having to
> traverse the whole list when comparing.  Binaries are compared as lists
> since both are used for handling strings, and also to be compared like
> memcmp does it.
>
> Tuples are used for general complex terms.  They can be compared size
> first whithout having to be fully traversed, as opposed to lists, which
> is useful in other contexts.  You can thereby choose your data
> structure to select comparision.


Thanks for the explanation.
Worth bookmarking for future reference.

-Craig
Reply | Threaded
Open this post in threaded view
|

Re: Binary, List and Tuple Inequalities (Paradox?)

Valentin Micic-6
In reply to this post by Raimo Niskanen-13

On 28 Oct 2019, at 10:02, Raimo Niskanen <[hidden email]> wrote:

I'd like to defend the current term order of lists, tuples and
binaries.

Lists are compared a'la string comparison, which avoids having to
traverse the whole list when comparing.  Binaries are compared as lists
since both are used for handling strings, and also to be compared like
memcmp does it.

Tuples are used for general complex terms.  They can be compared size
first whithout having to be fully traversed, as opposed to lists, which
is useful in other contexts.  You can thereby choose your data
structure to select comparision.
--
Raimo Niskanen


Thanks Raimo,

Your argument seems to have one…well, misconception. I don’t think it would be correct to imply that Binaries are used exclusively for handling strings, but rather, one of their uses may be that.

For example, here's a syntax that is not anything “string-like”:


(tsdb_1_1@macbookv-3)3593> <<1:16>> < <<2:16>>. 
true
(tsdb_1_1@macbookv-3)3594> <<50:16>> < <<36:16>>.
false


As you could see, this pretty much behaves the way we would expect when we compare two integer values, e.g.

1 is less than 2, and so is <<1:16>> < <<2:16>> Technically, we are not talking about strings here, but, rather two 16-bit integers.

I can also say that:

(tsdb_1_1@macbookv-3)3597> 1 < 000002.          
true

When we write: 

(tsdb_1_1@macbookv-3)3593> <<1:16>>. 

It would be wrong to pretend that we’re actually talking about a strings (e.g. in alphanumeric sense). This clearly means that the integer value of 1 stored using Big-Endian encoding (e.g. network byte order).

Thus, when we write:  <<2:16>> we get <<0,2>>. When we write: <<2:24>> we get <<0,0,2>>... these values are *not* intended to be strings, but integers.
So, when we add leading zeroes, we do not change the integer value.

So why is then:

(tsdb_1_1@macbookv-3)3600> <<1:16>> < <<2:24>>.
false

First, we’re clearly use integer syntax to encode integer values, then we have the first integer value encoded using 16-bits, and the second integer value encoded using 24-bits.
It just happens so, that 16-bits is used to encode the value of 1, and 24-bits to encode the value of two.

Thus, since 16-bits are less then 24-bits (in length), but also 1 is less than 2, one may expect this to yield TRUE. Yet somehow,  two octets are NOT LESS than there, nor 1 is NOT LESS than 2!

I think this cannot pass the "red-face test”, and thus does not deserve defending.

Contrast this with the way tuples are handled:

(tsdb_1_1@macbookv-3)3666> {1,1} < {1,2}.
true
(tsdb_1_1@macbookv-3)3667> {1,3} < {1,2}.
false
(tsdb_1_1@macbookv-3)3668> {1,3} < {1,2,3}.
true

Considering that Binaries may be used to encode ANYTHING, shouldn’t they be handled the same way as tuples instead of:


(tsdb_1_1@macbookv-3)3624> <<1,1>> < <<1,2>>.
true
(tsdb_1_1@macbookv-3)3625> <<1,3>> < <<1,2>>.
false
(tsdb_1_1@macbookv-3)3626> <<1,3>> < <<1,2,3>>.
false


As I said in my previous email, I do not expect Erlang to change, and for my "own intents and purposes” I am still considering if:

(tsdb_1_1@macbookv-3)3626> <<1,3>> < <<1,2,3>>.
false

should be given more credence than, say TRUE... if nothing else, because two octets are less than three octets.

In other words, if a three-element-tuple, regardless of its content,  could never be less than any given two-elements-tuple, why shouldn't the same hold for Binaries?

Kind regards

V/


Reply | Threaded
Open this post in threaded view
|

Re: Binary, List and Tuple Inequalities (Paradox?)

Dániel Szoboszlay
Hi Valentin,

I think you should acknowledge that the ordering of terms as defined by the Erlang language does not aim to fit any semantic interpretation of the data you may have. The sole purpose of the ordering is to be well defined, so you can create e.g. a binary search tree that holds whatever data. For that purpose any crazy ordering like 1 < false < 0 < [1, 2, foobar] < <0.12.34> < 42 < [] < 30 < baz ... would work fine, as long as the place of every term in this sequence is well defined. There's only one constraint (apart from not violating the mathematical properties of a strict ordering relation), and that is to make comparison of terms reasonably efficient (my above example would fail this check).

So you feel <<1:16>> should be less than <<2:24>>? That's because you interpret <<0,1>> and <<0,0,2>> as variable length encoding of integers. If you have this use case, than write a custom comparison function and sort your binaries with that. I may have a use case where the same two binaries mean operation #1 (with no arguments) and operation #0 (with a single argument 2), so I would have to write a different comparison function (but as lucky as I am, I may just use erlang:'<'/2 this time, phew!).

But if you would like to know why I find the built-in ordering of terms logical, it is because both lists and binaries are usually used to represent variable sized data, while tuples are for fixed size data. I mean that {1,2} and {1,2,3} typically describe semantically different things, e.g. a 2D coordinate and a semver version number. Their relative ordering is as irrelevant for most applications as the ordering of {1,2} and <0.0.1>. Ordering by tuple size first makes some sense because it will keep types together: just like all integers are less than atoms, all 2D coordinates are less than version numbers.

Binaries may represent fixed size data (e.g. a packet header), but than it doesn't matter whether you order lexicographically or size first, because all binaries in your domain will be the same size. Most often however binaries represent variable sized data, such as payload in a packet. And for variable sized data the human convention is lexicographical ordering. A printed dictionary may sort words by length first, and it would be a (mostly) usable dictionary, but that's just not how things are by convention.

Cheers,
Dániel

P.S.: By the way, this is why I don't like passing fixed length lists as arguments for init/1 in gen_* callbacks (such as init([Some, Args, Here]) -> ...). These should be tuples instead, they are not variable length!

On Wed, 30 Oct 2019 at 14:35, Valentin Micic <[hidden email]> wrote:

On 28 Oct 2019, at 10:02, Raimo Niskanen <[hidden email]> wrote:

I'd like to defend the current term order of lists, tuples and
binaries.

Lists are compared a'la string comparison, which avoids having to
traverse the whole list when comparing.  Binaries are compared as lists
since both are used for handling strings, and also to be compared like
memcmp does it.

Tuples are used for general complex terms.  They can be compared size
first whithout having to be fully traversed, as opposed to lists, which
is useful in other contexts.  You can thereby choose your data
structure to select comparision.
--
Raimo Niskanen


Thanks Raimo,

Your argument seems to have one…well, misconception. I don’t think it would be correct to imply that Binaries are used exclusively for handling strings, but rather, one of their uses may be that.

For example, here's a syntax that is not anything “string-like”:


(tsdb_1_1@macbookv-3)3593> <<1:16>> < <<2:16>>. 
true
(tsdb_1_1@macbookv-3)3594> <<50:16>> < <<36:16>>.
false


As you could see, this pretty much behaves the way we would expect when we compare two integer values, e.g.

1 is less than 2, and so is <<1:16>> < <<2:16>> Technically, we are not talking about strings here, but, rather two 16-bit integers.

I can also say that:

(tsdb_1_1@macbookv-3)3597> 1 < 000002.          
true

When we write: 

(tsdb_1_1@macbookv-3)3593> <<1:16>>. 

It would be wrong to pretend that we’re actually talking about a strings (e.g. in alphanumeric sense). This clearly means that the integer value of 1 stored using Big-Endian encoding (e.g. network byte order).

Thus, when we write:  <<2:16>> we get <<0,2>>. When we write: <<2:24>> we get <<0,0,2>>... these values are *not* intended to be strings, but integers.
So, when we add leading zeroes, we do not change the integer value.

So why is then:

(tsdb_1_1@macbookv-3)3600> <<1:16>> < <<2:24>>.
false

First, we’re clearly use integer syntax to encode integer values, then we have the first integer value encoded using 16-bits, and the second integer value encoded using 24-bits.
It just happens so, that 16-bits is used to encode the value of 1, and 24-bits to encode the value of two.

Thus, since 16-bits are less then 24-bits (in length), but also 1 is less than 2, one may expect this to yield TRUE. Yet somehow,  two octets are NOT LESS than there, nor 1 is NOT LESS than 2!

I think this cannot pass the "red-face test”, and thus does not deserve defending.

Contrast this with the way tuples are handled:

(tsdb_1_1@macbookv-3)3666> {1,1} < {1,2}.
true
(tsdb_1_1@macbookv-3)3667> {1,3} < {1,2}.
false
(tsdb_1_1@macbookv-3)3668> {1,3} < {1,2,3}.
true

Considering that Binaries may be used to encode ANYTHING, shouldn’t they be handled the same way as tuples instead of:


(tsdb_1_1@macbookv-3)3624> <<1,1>> < <<1,2>>.
true
(tsdb_1_1@macbookv-3)3625> <<1,3>> < <<1,2>>.
false
(tsdb_1_1@macbookv-3)3626> <<1,3>> < <<1,2,3>>.
false


As I said in my previous email, I do not expect Erlang to change, and for my "own intents and purposes” I am still considering if:

(tsdb_1_1@macbookv-3)3626> <<1,3>> < <<1,2,3>>.
false

should be given more credence than, say TRUE... if nothing else, because two octets are less than three octets.

In other words, if a three-element-tuple, regardless of its content,  could never be less than any given two-elements-tuple, why shouldn't the same hold for Binaries?

Kind regards

V/


Reply | Threaded
Open this post in threaded view
|

Off-topic from Binary, List and Tuple Inequalities (Paradox?)

Brujo Benavides-3
P.S.: By the way, this is why I don't like passing fixed length lists as arguments for init/1 in gen_* callbacks (such as init([Some, Args, Here]) -> ...). These should be tuples instead, they are not variable length!

YES!!! Yes yes!! N+1 times yes! I've said this many times: the only argument that init/1 has should be called Arg, not Args... And if you don't have anything to pass in to that function as an argument on start_link, just use 'undefined' or any other sensibly named atom... Even {} is better than [].
Using lists is not only semantically incorrect, it's also often confused with mfa (i.e. module, function & arguments, like in spawn/3) and people then define init with multiple arguments 🤦‍♂️

On Wed, 30 Oct 2019 at 20:13 Dániel Szoboszlay <[hidden email]> wrote:
Hi Valentin,

I think you should acknowledge that the ordering of terms as defined by the Erlang language does not aim to fit any semantic interpretation of the data you may have. The sole purpose of the ordering is to be well defined, so you can create e.g. a binary search tree that holds whatever data. For that purpose any crazy ordering like 1 < false < 0 < [1, 2, foobar] < <0.12.34> < 42 < [] < 30 < baz ... would work fine, as long as the place of every term in this sequence is well defined. There's only one constraint (apart from not violating the mathematical properties of a strict ordering relation), and that is to make comparison of terms reasonably efficient (my above example would fail this check).

So you feel <<1:16>> should be less than <<2:24>>? That's because you interpret <<0,1>> and <<0,0,2>> as variable length encoding of integers. If you have this use case, than write a custom comparison function and sort your binaries with that. I may have a use case where the same two binaries mean operation #1 (with no arguments) and operation #0 (with a single argument 2), so I would have to write a different comparison function (but as lucky as I am, I may just use erlang:'<'/2 this time, phew!).

But if you would like to know why I find the built-in ordering of terms logical, it is because both lists and binaries are usually used to represent variable sized data, while tuples are for fixed size data. I mean that {1,2} and {1,2,3} typically describe semantically different things, e.g. a 2D coordinate and a semver version number. Their relative ordering is as irrelevant for most applications as the ordering of {1,2} and <0.0.1>. Ordering by tuple size first makes some sense because it will keep types together: just like all integers are less than atoms, all 2D coordinates are less than version numbers.

Binaries may represent fixed size data (e.g. a packet header), but than it doesn't matter whether you order lexicographically or size first, because all binaries in your domain will be the same size. Most often however binaries represent variable sized data, such as payload in a packet. And for variable sized data the human convention is lexicographical ordering. A printed dictionary may sort words by length first, and it would be a (mostly) usable dictionary, but that's just not how things are by convention.

Cheers,
Dániel

P.S.: By the way, this is why I don't like passing fixed length lists as arguments for init/1 in gen_* callbacks (such as init([Some, Args, Here]) -> ...). These should be tuples instead, they are not variable length!

On Wed, 30 Oct 2019 at 14:35, Valentin Micic <[hidden email]> wrote:

On 28 Oct 2019, at 10:02, Raimo Niskanen <[hidden email]> wrote:

I'd like to defend the current term order of lists, tuples and
binaries.

Lists are compared a'la string comparison, which avoids having to
traverse the whole list when comparing.  Binaries are compared as lists
since both are used for handling strings, and also to be compared like
memcmp does it.

Tuples are used for general complex terms.  They can be compared size
first whithout having to be fully traversed, as opposed to lists, which
is useful in other contexts.  You can thereby choose your data
structure to select comparision.
--
Raimo Niskanen


Thanks Raimo,

Your argument seems to have one…well, misconception. I don’t think it would be correct to imply that Binaries are used exclusively for handling strings, but rather, one of their uses may be that.

For example, here's a syntax that is not anything “string-like”:


(tsdb_1_1@macbookv-3)3593> <<1:16>> < <<2:16>>. 
true
(tsdb_1_1@macbookv-3)3594> <<50:16>> < <<36:16>>.
false


As you could see, this pretty much behaves the way we would expect when we compare two integer values, e.g.

1 is less than 2, and so is <<1:16>> < <<2:16>> Technically, we are not talking about strings here, but, rather two 16-bit integers.

I can also say that:

(tsdb_1_1@macbookv-3)3597> 1 < 000002.          
true

When we write: 

(tsdb_1_1@macbookv-3)3593> <<1:16>>. 

It would be wrong to pretend that we’re actually talking about a strings (e.g. in alphanumeric sense). This clearly means that the integer value of 1 stored using Big-Endian encoding (e.g. network byte order).

Thus, when we write:  <<2:16>> we get <<0,2>>. When we write: <<2:24>> we get <<0,0,2>>... these values are *not* intended to be strings, but integers.
So, when we add leading zeroes, we do not change the integer value.

So why is then:

(tsdb_1_1@macbookv-3)3600> <<1:16>> < <<2:24>>.
false

First, we’re clearly use integer syntax to encode integer values, then we have the first integer value encoded using 16-bits, and the second integer value encoded using 24-bits.
It just happens so, that 16-bits is used to encode the value of 1, and 24-bits to encode the value of two.

Thus, since 16-bits are less then 24-bits (in length), but also 1 is less than 2, one may expect this to yield TRUE. Yet somehow,  two octets are NOT LESS than there, nor 1 is NOT LESS than 2!

I think this cannot pass the "red-face test”, and thus does not deserve defending.

Contrast this with the way tuples are handled:

(tsdb_1_1@macbookv-3)3666> {1,1} < {1,2}.
true
(tsdb_1_1@macbookv-3)3667> {1,3} < {1,2}.
false
(tsdb_1_1@macbookv-3)3668> {1,3} < {1,2,3}.
true

Considering that Binaries may be used to encode ANYTHING, shouldn’t they be handled the same way as tuples instead of:


(tsdb_1_1@macbookv-3)3624> <<1,1>> < <<1,2>>.
true
(tsdb_1_1@macbookv-3)3625> <<1,3>> < <<1,2>>.
false
(tsdb_1_1@macbookv-3)3626> <<1,3>> < <<1,2,3>>.
false


As I said in my previous email, I do not expect Erlang to change, and for my "own intents and purposes” I am still considering if:

(tsdb_1_1@macbookv-3)3626> <<1,3>> < <<1,2,3>>.
false

should be given more credence than, say TRUE... if nothing else, because two octets are less than three octets.

In other words, if a three-element-tuple, regardless of its content,  could never be less than any given two-elements-tuple, why shouldn't the same hold for Binaries?

Kind regards

V/


--
Brujo Benavides
about.me/elbrujohalcon
Reply | Threaded
Open this post in threaded view
|

Re: Binary, List and Tuple Inequalities (Paradox?)

Raimo Niskanen-13
In reply to this post by Valentin Micic-6
On Wed, 2019-10-30 at 15:34 +0200, Valentin Micic wrote:

> > On 28 Oct 2019, at 10:02, Raimo Niskanen <
> > [hidden email]> wrote:
> >
> > I'd like to defend the current term order of lists, tuples and
> > binaries.
> >
> > Lists are compared a'la string comparison, which avoids having to
> > traverse the whole list when comparing.  Binaries are compared as
> > lists
> > since both are used for handling strings, and also to be compared
> > like
> > memcmp does it.
> >
> > Tuples are used for general complex terms.  They can be compared
> > size
> > first whithout having to be fully traversed, as opposed to lists,
> > which
> > is useful in other contexts.  You can thereby choose your data
> > structure to select comparision.
> > --
> > Raimo Niskanen
> >
>
> Thanks Raimo,
>
> Your argument seems to have one…well, misconception. I don’t think it
> would be correct to imply that Binaries are used exclusively for
> handling strings, but rather, one of their uses may be that.

I did not mean to imply that binaries are used exclusevily for handling
strings.  I merely said that both lists and binaries are used for
handling strings, i.e that strings is one use case.

Although I confess that I think it is an important use case, if not
even a very important use case.

When binaries were introduced it was already recognized that having
strings as lists is a clumsy representation, especially with 64-bit
then on the way with almost double the space waste.  So one could
foresee that binaries would be used for storing strings.

And to not have strings as binaries sort like strings as lists would be
a really bad idea for that use case.

As for now, Elixir uses binaries as their default representation for
strings, so having them sort as strings is vital for Elixir.
(Unicode has shot a big hole in that argument, but still true for
latin-1)

>
> For example, here's a syntax that is not anything “string-like”:
>
>
> (tsdb_1_1@macbookv-3)3593> <<1:16>> < <<2:16>>.
> true
> (tsdb_1_1@macbookv-3)3594> <<50:16>> < <<36:16>>.
> false
>
:
:

> As I said in my previous email, I do not expect Erlang to change, and
> for my "own intents and purposes” I am still considering if:
>
> (tsdb_1_1@macbookv-3)3626> <<1,3>> < <<1,2,3>>.
> false
>
> should be given more credence than, say TRUE... if nothing else,
> because two octets are less than three octets.
>
> In other words, if a three-element-tuple, regardless of its content,
>  could never be less than any given two-elements-tuple, why shouldn't
> the same hold for Binaries?

If you want to compare terms as integers, then represent them as
integers!  Erlang bignums can be big enough for most purposes.

Erlang does not have operator overloading or subclassing, and has a
global term order.  So exactly one sort order has to be chosen for a
type, and then a guess has to be made of which one is "best" or "most
useful".

The one chosen for binaries was string sort order, because that is how
C compares binaries, and strings seemed to be a very important use case
for binaries, which Elixir kind of has proven after the fact.

The tuple sort order could have been chosen for binaries, but as I said
above that would be bad for the string use case.

Also, regarding "useful"; tuple sort order for binaries can easily be
implemented by first comparing sizes then comparing content, whilst
should you have tulpe sort order and want to implement list sort order
you would have to loop byte by byte.  Therefore the list sort order can
be regarded as more useful since it can be a building block for
efficient tuple sort order...

Best regards
/ Raimo Niskanen

>
> Kind regards
>
> V/
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Binary, List and Tuple Inequalities (Paradox?)

Richard O'Keefe
You can represent all sorts of information as binaries,
but the binary doesn't know.  It may be that you mean
<<65,66,67>> to represent "ABC"; it may be that you
mean to represent 4276803; it may be that you mean
to represent 'play A below middle c for 66 milliseconds
on instrument 67'.  Your intentions are not part of the
binary.  So there is no way to have a sort order for
binaries (or anything else) that perfectly suits all use
cases.  Even for strings, English dictionaries and
telephone books use *different* orders and neither of
them matches strcmp().

For Prolog, I managed in 1984 to come up with a
justification for the ordering of terms of different kinds
(coherence between @< and subtypes) which the ISO
committee chose to ignore.

The best we can hope for in a "global" term ordering is
that it satisfy the axioms of a total order.  That's a useful
thing to have.

Want something else?  Program it.

On Thu, 31 Oct 2019 at 23:53, Raimo Niskanen
<[hidden email]> wrote:

>
> On Wed, 2019-10-30 at 15:34 +0200, Valentin Micic wrote:
> > > On 28 Oct 2019, at 10:02, Raimo Niskanen <
> > > [hidden email]> wrote:
> > >
> > > I'd like to defend the current term order of lists, tuples and
> > > binaries.
> > >
> > > Lists are compared a'la string comparison, which avoids having to
> > > traverse the whole list when comparing.  Binaries are compared as
> > > lists
> > > since both are used for handling strings, and also to be compared
> > > like
> > > memcmp does it.
> > >
> > > Tuples are used for general complex terms.  They can be compared
> > > size
> > > first whithout having to be fully traversed, as opposed to lists,
> > > which
> > > is useful in other contexts.  You can thereby choose your data
> > > structure to select comparision.
> > > --
> > > Raimo Niskanen
> > >
> >
> > Thanks Raimo,
> >
> > Your argument seems to have one…well, misconception. I don’t think it
> > would be correct to imply that Binaries are used exclusively for
> > handling strings, but rather, one of their uses may be that.
>
> I did not mean to imply that binaries are used exclusevily for handling
> strings.  I merely said that both lists and binaries are used for
> handling strings, i.e that strings is one use case.
>
> Although I confess that I think it is an important use case, if not
> even a very important use case.
>
> When binaries were introduced it was already recognized that having
> strings as lists is a clumsy representation, especially with 64-bit
> then on the way with almost double the space waste.  So one could
> foresee that binaries would be used for storing strings.
>
> And to not have strings as binaries sort like strings as lists would be
> a really bad idea for that use case.
>
> As for now, Elixir uses binaries as their default representation for
> strings, so having them sort as strings is vital for Elixir.
> (Unicode has shot a big hole in that argument, but still true for
> latin-1)
>
> >
> > For example, here's a syntax that is not anything “string-like”:
> >
> >
> > (tsdb_1_1@macbookv-3)3593> <<1:16>> < <<2:16>>.
> > true
> > (tsdb_1_1@macbookv-3)3594> <<50:16>> < <<36:16>>.
> > false
> >
> :
> :
> > As I said in my previous email, I do not expect Erlang to change, and
> > for my "own intents and purposes” I am still considering if:
> >
> > (tsdb_1_1@macbookv-3)3626> <<1,3>> < <<1,2,3>>.
> > false
> >
> > should be given more credence than, say TRUE... if nothing else,
> > because two octets are less than three octets.
> >
> > In other words, if a three-element-tuple, regardless of its content,
> >  could never be less than any given two-elements-tuple, why shouldn't
> > the same hold for Binaries?
>
> If you want to compare terms as integers, then represent them as
> integers!  Erlang bignums can be big enough for most purposes.
>
> Erlang does not have operator overloading or subclassing, and has a
> global term order.  So exactly one sort order has to be chosen for a
> type, and then a guess has to be made of which one is "best" or "most
> useful".
>
> The one chosen for binaries was string sort order, because that is how
> C compares binaries, and strings seemed to be a very important use case
> for binaries, which Elixir kind of has proven after the fact.
>
> The tuple sort order could have been chosen for binaries, but as I said
> above that would be bad for the string use case.
>
> Also, regarding "useful"; tuple sort order for binaries can easily be
> implemented by first comparing sizes then comparing content, whilst
> should you have tulpe sort order and want to implement list sort order
> you would have to loop byte by byte.  Therefore the list sort order can
> be regarded as more useful since it can be a building block for
> efficient tuple sort order...
>
> Best regards
> / Raimo Niskanen
>
> >
> > Kind regards
> >
> > V/
> >
> >
Reply | Threaded
Open this post in threaded view
|

Re: Binary, List and Tuple Inequalities (Paradox?)

Valentin Micic-6


> On 31 Oct 2019, at 21:43, Richard O'Keefe <[hidden email]> wrote:
>
> You can represent all sorts of information as binaries,
> but the binary doesn't know.  It may be that you mean
> <<65,66,67>> to represent "ABC"; it may be that you
> mean to represent 4276803; it may be that you mean
> to represent 'play A below middle c for 66 milliseconds
> on instrument 67'.  Your intentions are not part of the
> binary.  So there is no way to have a sort order for
> binaries (or anything else) that perfectly suits all use
> cases.  Even for strings, English dictionaries and
> telephone books use *different* orders and neither of
> them matches strcmp().
>
> For Prolog, I managed in 1984 to come up with a
> justification for the ordering of terms of different kinds
> (coherence between @< and subtypes) which the ISO
> committee chose to ignore.
>
> The best we can hope for in a "global" term ordering is
> that it satisfy the axioms of a total order.  That's a useful
> thing to have.
>
> Want something else?  Program it.

Yes, indeed, and I’ve implied that much — I do want to program it, but I also wanted to know what people think  — if one aspires to program for others, it seems reasonable to consider dominant logic on the issue.

I started by thinking that less cannot be more, e.g. two octets cannot be less than three.

Now I am increasingly convinced that one should never compare Binaries that are not of the same size.
However, I am not so sure what to do if, or rather when, one has to (compare binaries of unequal sizes).

So here’s the question:

Presuming that one does not know anything about semantic given by data stored in a Binary, but one knows about the size of the binaries being compared, wouldn’t it make sense then to consider that two octets should be less than three?

Wouldn’t this approach fit the most general case (independent of what Elixir people do).

V/


>
> On Thu, 31 Oct 2019 at 23:53, Raimo Niskanen
> <[hidden email]> wrote:
>>
>> On Wed, 2019-10-30 at 15:34 +0200, Valentin Micic wrote:
>>>> On 28 Oct 2019, at 10:02, Raimo Niskanen <
>>>> [hidden email]> wrote:
>>>>
>>>> I'd like to defend the current term order of lists, tuples and
>>>> binaries.
>>>>
>>>> Lists are compared a'la string comparison, which avoids having to
>>>> traverse the whole list when comparing.  Binaries are compared as
>>>> lists
>>>> since both are used for handling strings, and also to be compared
>>>> like
>>>> memcmp does it.
>>>>
>>>> Tuples are used for general complex terms.  They can be compared
>>>> size
>>>> first whithout having to be fully traversed, as opposed to lists,
>>>> which
>>>> is useful in other contexts.  You can thereby choose your data
>>>> structure to select comparision.
>>>> --
>>>> Raimo Niskanen
>>>>
>>>
>>> Thanks Raimo,
>>>
>>> Your argument seems to have one…well, misconception. I don’t think it
>>> would be correct to imply that Binaries are used exclusively for
>>> handling strings, but rather, one of their uses may be that.
>>
>> I did not mean to imply that binaries are used exclusevily for handling
>> strings.  I merely said that both lists and binaries are used for
>> handling strings, i.e that strings is one use case.
>>
>> Although I confess that I think it is an important use case, if not
>> even a very important use case.
>>
>> When binaries were introduced it was already recognized that having
>> strings as lists is a clumsy representation, especially with 64-bit
>> then on the way with almost double the space waste.  So one could
>> foresee that binaries would be used for storing strings.
>>
>> And to not have strings as binaries sort like strings as lists would be
>> a really bad idea for that use case.
>>
>> As for now, Elixir uses binaries as their default representation for
>> strings, so having them sort as strings is vital for Elixir.
>> (Unicode has shot a big hole in that argument, but still true for
>> latin-1)
>>
>>>
>>> For example, here's a syntax that is not anything “string-like”:
>>>
>>>
>>> (tsdb_1_1@macbookv-3)3593> <<1:16>> < <<2:16>>.
>>> true
>>> (tsdb_1_1@macbookv-3)3594> <<50:16>> < <<36:16>>.
>>> false
>>>
>> :
>> :
>>> As I said in my previous email, I do not expect Erlang to change, and
>>> for my "own intents and purposes” I am still considering if:
>>>
>>> (tsdb_1_1@macbookv-3)3626> <<1,3>> < <<1,2,3>>.
>>> false
>>>
>>> should be given more credence than, say TRUE... if nothing else,
>>> because two octets are less than three octets.
>>>
>>> In other words, if a three-element-tuple, regardless of its content,
>>> could never be less than any given two-elements-tuple, why shouldn't
>>> the same hold for Binaries?
>>
>> If you want to compare terms as integers, then represent them as
>> integers!  Erlang bignums can be big enough for most purposes.
>>
>> Erlang does not have operator overloading or subclassing, and has a
>> global term order.  So exactly one sort order has to be chosen for a
>> type, and then a guess has to be made of which one is "best" or "most
>> useful".
>>
>> The one chosen for binaries was string sort order, because that is how
>> C compares binaries, and strings seemed to be a very important use case
>> for binaries, which Elixir kind of has proven after the fact.
>>
>> The tuple sort order could have been chosen for binaries, but as I said
>> above that would be bad for the string use case.
>>
>> Also, regarding "useful"; tuple sort order for binaries can easily be
>> implemented by first comparing sizes then comparing content, whilst
>> should you have tulpe sort order and want to implement list sort order
>> you would have to loop byte by byte.  Therefore the list sort order can
>> be regarded as more useful since it can be a building block for
>> efficient tuple sort order...
>>
>> Best regards
>> / Raimo Niskanen
>>
>>>
>>> Kind regards
>>>
>>> V/
>>>
>>>

Reply | Threaded
Open this post in threaded view
|

Re: Binary, List and Tuple Inequalities (Paradox?)

zxq9-2
On 2019/11/01 6:03, Valentin Micic wrote:
>> On 31 Oct 2019, at 21:43, Richard O'Keefe <[hidden email]> wrote:
>>
>> You can represent all sorts of information as binaries,
>> but the binary doesn't know.
>> Your intentions are not part of the binary.

...

>> The best we can hope for in a "global" term ordering is
>> that it satisfy the axioms of a total order.  That's a useful
>> thing to have.
>>
>> Want something else?  Program it.

> Yes, indeed, and I’ve implied that much — I do want to program it, but I also wanted to know what people think  — if one aspires to program for others, it seems reasonable to consider dominant logic on the issue.
>
> I started by thinking that less cannot be more, e.g. two octets cannot be less than three.
>
> Now I am increasingly convinced that one should never compare Binaries that are not of the same size.
> However, I am not so sure what to do if, or rather when, one has to (compare binaries of unequal sizes).

Not so fast. You MUST be able to compare things in *some* way or else
you cannot establish a global order, and without that an entire universe
of useful things is closed to you.

Binaries are a primitive data type.
Strings can NEVER be a primitive data type.

With this in mind, you could useful things with custom types such as
"binary string", "list string", "mixed string"...

   -type bstring :: {bstring, binary()}.
   -type lstring :: {lstring, string()}.
   -type mstring :: {mstring, iolist()}.

Your types could also (whoa!) include meta that indicates the subtype of
the string data -- not merely whether it encoded as UTF8, but what the
collation rule is intended to be, whether the encoding is canonical and
if so which form, etc. (In my region this is all quite a big deal...)

   -type bstring :: {bstring, encoding(), collation(), lang(), binary()}.

Now you could REALLY do some useful sorting!

But not on binaries. They are primitives. The fact we have a universal
order across ALL primitives is great and something that goes
unappreciated until you run into a problem where it is critical to have
such an order and you lack it.

This discussion stems from a failure to distinguish between a complex
type (strings) and a language primitive.

In most modern OOP languages strings are objects (whether immutable or
not) and carry a fair amount of meta with them. When you rip the arms
and legs off an object you are essentially left with a tuple that
carries the core data itself along with useful descriptive meta. That's
what is needed to resolve the string sorting problem: the string itself
plus meta to describe its nature.

There is nothing even remotely weird about the way Erlang sorting order
works. That's a platform issue in Erlang. If we want arbitrary collation
sorts over a true (complex) string type, we're going to need to write a
library to get it.

-Craig
Reply | Threaded
Open this post in threaded view
|

Re: Binary, List and Tuple Inequalities (Paradox?)

Raimo Niskanen-13
In reply to this post by Raimo Niskanen-13
On Thu, 2019-10-31 at 20:20 +0400, Antonio Nikishaev wrote:

> > On 31 Oct 2019, at 14:52, Raimo Niskanen <
> > [hidden email]> wrote:
> >
> >
> > As for now, Elixir uses binaries as their default representation
> > for
> > strings, so having them sort as strings is vital for Elixir.
> > (Unicode has shot a big hole in that argument, but still true for
> > latin-1)
>
> It has not.  UTF-8 does preserve order.

Ah, yes!  That clever property of UTF-8 slipped my mind.

So in both Erlang and Elixir; codepoint lists sorts the same way as
their corresponding UTF-8 binaries (which is the default format for
binaris as strings in both Erlang and Elixir).

Not true for other Unicode binary encodings, though...

/ Raimo



>
>
>
>
> -- lelf
>
Reply | Threaded
Open this post in threaded view
|

Re: Binary, List and Tuple Inequalities (Paradox?)

Richard O'Keefe
I note that not only do dictionaries and phone books use different
collation orders,
so (fairly obviously) do different languages, and the rules can be
insanely complex.
English and French, for example, both require up to four passes over a
pair of strings
to resolve the order, and French runs one of those passes backwards.
(According to
the draft standard I read about 10 years ago.)  Things get really
complicated in a
bilingual country like Canada or New Zealand, where you have to
process text in a
mix of two languages.  (I'm not talking about scholarly articles, but
about everyday
newspaper articles.) It gets worse, but enough already.

The bottom line is that there is NO "right" way to compare binaries
and there is NO
"wrong" way either, as long as the axioms of a total order are
respected.  The way
Erlang does it is one good way.  "Compute the base-255 digits of pi to as many
places as necessary, exclusive-or each binary with the appropriate digits, and
compare the results using the existing algorithm" is another perfectly
legal way.

There are basically four requirements for Erlang-style term ordering:
(1) the axioms of a total order must be satisfied
(2) the implementation must be fast enough that programmers are happy to use
term comparison without wondering whether they should program something
(3) the rules must be spelled out so that programmers know what they are getting
(4) if the language uses the same operators for term comparison and arithmetic
comparison, the two must be compatible; if (like Prolog) term comparison (@<)
and arithmetic comparison (<) can disagree.  Just think of the fun you can have
trying to figure out what to do with NaNs.
There is another desideratum, which isn't quite as important:
(+) the "native" representation for text should be ordered in a way that is
unsurprising for programmers but MUST NOT be identical to locale collation
order for two reasons:
  (A) term comparison must distinguish things that locale collation
might identify
  (B) term comparison is for computers and is not meant to be locale-sensitive,
       while collation for users should be locale-sensitive.

It might be worth comparing three different languages I use:
Erlang: term comparison is a total order
Smalltalk:  x = y must return true or false consistently for ANY x and y
 but x < y need not work if x and y are of different kinds and for a
 given x (like nil) need not ever work for any y (even y = x).
Haskell: comparison is supposed to be total (thanks,IEEE) for any type
 that supports it; whether a type supports comparison is detected at
 compile-time; how the comparison is done can be different for every
 type.




On Fri, 1 Nov 2019 at 21:40, Raimo Niskanen <[hidden email]> wrote:

>
> On Thu, 2019-10-31 at 20:20 +0400, Antonio Nikishaev wrote:
> > > On 31 Oct 2019, at 14:52, Raimo Niskanen <
> > > [hidden email]> wrote:
> > >
> > >
> > > As for now, Elixir uses binaries as their default representation
> > > for
> > > strings, so having them sort as strings is vital for Elixir.
> > > (Unicode has shot a big hole in that argument, but still true for
> > > latin-1)
> >
> > It has not.  UTF-8 does preserve order.
>
> Ah, yes!  That clever property of UTF-8 slipped my mind.
>
> So in both Erlang and Elixir; codepoint lists sorts the same way as
> their corresponding UTF-8 binaries (which is the default format for
> binaris as strings in both Erlang and Elixir).
>
> Not true for other Unicode binary encodings, though...
>
> / Raimo
>
>
>
> >
> >
> >
> >
> > -- lelf
> >
Reply | Threaded
Open this post in threaded view
|

Re: Binary, List and Tuple Inequalities (Paradox?)

Richard O'Keefe
In reply to this post by Valentin Micic-6
Historical note: back in, oh, 1978 or 1979, I came across a PL/I
dialect which compared
strings X Y thus:
  if X is longer than Y,  X > Y
  if X is shorter than Y,  X < Y
  otherwise, compare the characters.
It was rather disconcerting to find that 'KETTLE' > 'CAT'.
However, it was also disconcerting to realise that the normal PL/I
rules are equally quirky:
  "If the strings are not the same length, pad the shorter on the
right with blanks,
   and then compare."
This means, for example, that you can find strings X Y such that X > X||Y
(where || is the string concatenation operator in PL/I).
Fortran does the same as PL/I here: first pad with blanks to the same
length, then compare.
And it has the same rather confusing result.
It turned out that the dialect I mentioned at the start had changed
the rules to get
something easier to reason about.  Of course, lexicographic order
would have been
easy to reason about too...

I built a string library once that stored strings in big-endian order
even on little-endian
machines, and word-aligned them, so that you could compare 4 (or 8)
bytes at a time.
It turned out not to be as much faster as I had hoped.

On Sat, 26 Oct 2019 at 03:37, Valentin Micic <[hidden email]> wrote:

>
> Hi all,
>
> Consider the following inequalities:
>
> (tsdb_1_1@macbookv-3)2876> <<0,0,1>> < <<1,0,0>>.
> true
>
> As well as:
>
> (tsdb_1_1@macbookv-3)2878> [0,0,1] < [1,0,0].
> true
>
> The result (true) makes sense — in both cases operands are of the same size, and the first (leftmost) element value of the left operand is less than  first (again, leftmost) element value of the right operand.
>
> However:
>
> (tsdb_1_1@macbookv-3)2877> <<0,0,1>> < <<1,0>>.
> true
>
> and
>
> (tsdb_1_1@macbookv-3)2879> [0,0,1] < [1,0].
> true
>
> This indicates that the actual length of the operands are not considered, for clearly, three octets cannot be less then two, right?
>
> This becomes even more confusing if you check how tuples are compared given the similar test-case:
>
> (tsdb_1_1@macbookv-3)2886> {0,0,1} < {1,0,0}.
> true
> (tsdb_1_1@macbookv-3)2887> {0,0,1} < {1,0}.
> false
>
> Here, the number of elements in the tuple are clearly taken into the consideration.
>
> Then, if one converts all three pairs of operands to its (external) binary representation, e.g.
>
> Binary:
>
> (tsdb_1_1@macbookv-3)2916> term_to_binary( <<0,0,1>> ).
> <<131,109,0,0,0,3,0,0,1>>
> (tsdb_1_1@macbookv-3)2917> term_to_binary( <<1,0>> ).
> <<131,109,0,0,0,2,1,0>>
>
> List:
>
> tsdb_1_1@macbookv-3)2918> term_to_binary( [0,0,1] ).
> <<131,107,0,3,0,0,1>>
> (tsdb_1_1@macbookv-3)2919> term_to_binary( [1,0] ).
> <<131,107,0,2,1,0>>
>
> Tuple:
> (tsdb_1_1@macbookv-3)2920> term_to_binary( {0,0,1} ).
> <<131,104,3,97,0,97,0,97,1>>
> (tsdb_1_1@macbookv-3)2921> term_to_binary( {1,0} ).
> <<131,104,2,97,1,97,0>>
>
> One could see that the number of “elements” are known and available for comparison (I’ve highlighted this number using bold, yellow font) for all three situations.
>
> And yet, the different approaches between binary and lists on one side, and tuples, on another, appear to be deliberate as binary and list types are following the approach used by C standard library memcmp function, whereas tuples do not, e.g.
>
> int memcmp(const void *s1, const void *s2, size_t n);
>
> In other words, operands are reduced to some common size ’n', and then compared as if they were equal in length.
>
> I do understand this restriction in C -- if not for ’n’, the argument sizes are not known, and without it the execution may end up in a catastrophic failure (e.g. segment violation, hmm.. which can still happen anyway if ’n’ is inadequate).
>
> So, THE QUESTION:
>
> Why would it be wrong to consider inequalities for binary() data types the same way we do it for tuples — number of “elements” first and  “numeric values” of individual elements — second?
>
> In my view, this is not only not wrong, but it would be more logical, and this is why.
>
> Look at the following comparison:
>
> (tsdb_1_1@macbookv-3)2957> <<0,0,1>> < <<0,0,1,0>>.
> true
>
>
> Well, this make sense, for less-is-less (e.g. three octets are less than four octets) and even if one considers  <<0,0,1>> as a 24-bit integer value, and <<0,0,1,0>> as a 32-bit integer, one may “normalise” the 24-bit integer to its 32-bit equivalent by adding a leading zero, and the comparison expression would still return true:
>
> (tsdb_1_1@macbookv-3)2961> <<0,0,0,1>> < <<0,0,1,0>>.
> true
>
> Therefore, as expected, 1 <  256, and, thus, we may be forgiven if we have the same expectation when we write:
>
> (tsdb_1_1@macbookv-3)2958> <<1>> < <<0,0,1,0>>.
>
> Right? Well, not so, because:
>
> (tsdb_1_1@macbookv-3)2958> <<1>> < <<0,0,1,0>>.
> false
>
> Thus, "less is more", and therefore 1 appears to be greater-than-or-equal-to  256. Somehow.
>
> This could never happen if one considered a number of elements (in this case number of octets) before attempting to compare the individual element values.
>
> But wait, the "less-is-more" may easily become its opposite — “more is less”, because:
>
> (tsdb_1_1@macbookv-3)2973> <<32, 97>> < <<97>>.
> true
>
> Yes, this approach may help us to sort a list of, say,  surnames alphabetically, so we can have all the “Ms” before the “Ns”, regardless of how long the surname is, but is this actually logical for two arbitrary binaries? Seeing that we actually care about number of elements when we deal with tuples, why not extend the same approach to arbitrary binary values?
>
> Why do we presume that binaries are given as STRINGS when we compare them? When it comes to lists, the situation is even worse, because lists do, but then do not follow the STRING comparison semantic:
>
> (tsdb_1_1@macbookv-3)2991> [32, 97] < [97].
> true
>
> *** But! ***
>
> (tsdb_1_1@macbookv-3)2992> [{32, 97}] < [97].
> false
>
> (tsdb_1_1@macbookv-3)2993> [{32, 97}] < [{97}].
> false
>
> And just to be clear, I do not expect Erlang to change. Nor do I care about how lists are handled (well, at least not for the moment).
>
> My main concern is semantic behind comparing two arbitrary binary() operands. For one of my projects I’d like to employ tuple-like semantic (.e.g. number of octets first, values later) when comparing two arbitrary binaries and thus avoid “less-is-more” and “more-is-less” paradoxes.
>
> Is there any reason why I shouldn’t? (Okay, it may be a bit slower, but one may argue this approach may be even a bit faster for a long-enough binaries. But what else?)
>
> At any rate, I would appreciate your reasoning on the topic. What resonates better with you?
>
>
> Thanks in advance.
>
> V/
>
>