Binary string literal syntax

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
30 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Binary string literal syntax

Sean Hinde-4
>>> Sure, most people have no clue how to program sockets these days so they use HTTP for everything -- but that isn't *most* protocols, that's a relatively small set of overwhelmingly *prolific* protocols. My prediction is that binary protocols will become more prolific as the extremely limited shared resource of wireless bandwidth becomes more and more saturated (and I don't think compression is a fix-all here, though it certainly helps).
>> I don’t think it really matters how we count. Text based protocols are here and Erlang ought to provide a great programming environment for them too.
>
> But they're on the way out. You won't find many new text-based protocols, and for good reasons. Even HTTP/2 went binary (and QUIC/HTTP will do the same).
>
> Plain-text is still king for content, but the trend has been toward binaries in recent years. Look at the number of binary serialization formats that popped up. Of course, it will be harder to take over JSON.

Even as an old school telecom protocols guy I’m not sure I really like this move for web based protocols. It’s nice to be able to read a protocol as text - however good one gets at reading hex dumps. HTTP Headers are a minuscule percentage of internet traffic which is dominated by video.

In any case these things tend to have pendulum like properties :)

Sean

>
> Cheers,
>
> --
> Loïc Hoguin
> https://ninenines.eu

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Binary string literal syntax

Loïc Hoguin-3
On 06/07/2018 11:57 AM, Sean Hinde wrote:

>>>> Sure, most people have no clue how to program sockets these days so they use HTTP for everything -- but that isn't *most* protocols, that's a relatively small set of overwhelmingly *prolific* protocols. My prediction is that binary protocols will become more prolific as the extremely limited shared resource of wireless bandwidth becomes more and more saturated (and I don't think compression is a fix-all here, though it certainly helps).
>>> I don’t think it really matters how we count. Text based protocols are here and Erlang ought to provide a great programming environment for them too.
>>
>> But they're on the way out. You won't find many new text-based protocols, and for good reasons. Even HTTP/2 went binary (and QUIC/HTTP will do the same).
>>
>> Plain-text is still king for content, but the trend has been toward binaries in recent years. Look at the number of binary serialization formats that popped up. Of course, it will be harder to take over JSON.
>
> Even as an old school telecom protocols guy I’m not sure I really like this move for web based protocols. It’s nice to be able to read a protocol as text - however good one gets at reading hex dumps. HTTP Headers are a minuscule percentage of internet traffic which is dominated by video.
>
> In any case these things tend to have pendulum like properties :)

Perhaps.

I don't think the "I can read it" argument is a good reason to go one
way or the other though. You can read debug, trace or wireshark output
just fine. And the protocol being text is not necessarily helpful
because of two reasons.

One is invisible characters, which requires you to make sure you can see
them when debugging anyway (using \n instead of \r\n for example).

The other is that on content like JSON, unless your payload is small
you're not going to be able to parse it in your head anyway, so you'll
need tools to make sense of it. Same reason why developer tools are so
useful when writing HTML/CSS.

Ultimately I think a language should be good at both, and I think Erlang
is doing a fine job at parsing both text and binary. The <<"syntax">> is
certainly unfortunate for binary strings, even more so when Unicode is
required. I'm not sure this is something that should be fixed at the
language level though. Some editors put a ) when you write (, perhaps
they should also put >> when you write <<, with or without double
quotes. Perhaps they already do? It's a strategy that's worked well for
more verbose languages like C++ and Java I think.

--
Loïc Hoguin
https://ninenines.eu
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Binary string literal syntax

Sean Hinde-4

>> Even as an old school telecom protocols guy I’m not sure I really like this move for web based protocols. It’s nice to be able to read a protocol as text - however good one gets at reading hex dumps. HTTP Headers are a minuscule percentage of internet traffic which is dominated by video.
>> In any case these things tend to have pendulum like properties :)
>
> Perhaps.
>
> I don't think the "I can read it" argument is a good reason to go one way or the other though. You can read debug, trace or wireshark output just fine. And the protocol being text is not necessarily helpful because of two reasons.
>
> One is invisible characters, which requires you to make sure you can see them when debugging anyway (using \n instead of \r\n for example).
>
> The other is that on content like JSON, unless your payload is small you're not going to be able to parse it in your head anyway, so you'll need tools to make sense of it. Same reason why developer tools are so useful when writing HTML/CSS.

I recall the “I can read it” argument was made at the time in part as a rebellion against the "save every bit" mentality in the communications industry standardisation.

Encryption also pretty much kills off wireshark unless you have the keys, so we are only talking about end point debugging.

But sure, it’s not a big reason.

>
> Ultimately I think a language should be good at both, and I think Erlang is doing a fine job at parsing both text and binary. The <<"syntax">> is certainly unfortunate for binary strings, even more so when Unicode is required. I'm not sure this is something that should be fixed at the language level though. Some editors put a ) when you write (, perhaps they should also put >> when you write <<, with or without double quotes. Perhaps they already do? It's a strategy that's worked well for more verbose languages like C++ and Java I think.

Without reader macros syntax can’t be fixed except at the language level (and no, I’m not arguing we should make another lisp!)

But it does feel that when the time is right we ought to be able have the nice things with elegant syntax.

BTW The Erlang mode I’m using for VSCode auto adds the closing >> (the one from Pierrick Gourlain, which is really very nice)

Also BTW I’m using REST mode of cowboy for this json based system and it’s really very nice. Thank you!

Sean

>
> --
> Loïc Hoguin
> https://ninenines.eu

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Binary string literal syntax

Sean Hinde-4
In reply to this post by zxq9-2
>>
>> The bit syntax was designed for picking apart bit twiddling telecom protocols. It was clearly not designed with the primary goal of representing alternative forms of string literals. It’s just not what you would choose for that application.
>
> The main problem I see with this particular example is that you feel you were dealing with a "string-based protocol" because you were dealing with JSON.
>
> You weren't -- JSON is a list of trees. It is serialized as a string, and strings are used to represent things in JSON that JSON itself is *dramatically* unsuited for, so "eveything is a string" seems reasonable to people who don't know anything about type systems or were hustled into pushing a "lipstick on a chicken" prototype into production.
>
> That last case is so common that a lot of new coders haven't ever seen anything *but* JSON in practice. That doesn't mean we should optimize for wrongness.
>
> The point of exacerbation is that you are using a JSON serializer that outputs lists of trees of pairs that contain binary snippets instead of lists as the string representations (Jiffy, I imagine). That isn't the best way to deal with strings in Erlang, imo.

That’s a fair summary of my current state of exacerbation :) And yes, jiffy this week. Binary snippets as keys in json suit the use case well enough so long as the decoded representation can be sanely matched against a string literal. Atoms have their other problems, and strings as lists are just plain annoying for dealing with incoming data that can also take the form of lists of other objects (is_string? - https://stackoverflow.com/questions/2479713/determining-if-an-item-is-a-string-or-a-list-in-erlang?noredirect=1&lq=1)

>
> So we have a conflation of issues here:
> - Strings (or more broadly, io_data()) in Erlang can *actually* represent Unicode types because they can represent things as lexemes not just a flat array of codepoints. That's actually quite advanced.
> - Binaries are just that: binaries. They were indeed never intended for advanced string processing.
> - Binaries *can* represent strings, are more compact in memory and are easier to deal with in NIFs, which is why Jiffy uses them.
> - Jiffy is the most common JSON serializer for Erlang.
>
> Not a single of these issues is addressed or made easier to deal with by a new syntax that equates to <<"foo"/utf8>>. In fact, the /utf8 binary identifier has only been brought up a few times in this thread because it isn't the point.

I tend to believe that syntax is important, not for the “Wah Wah it looks too weird I can’t use that language” reason, but because it defines the UX of the system. And UX does drive behaviour. As much as we would like to think that everyone will think really hard about exactly which representation they need at each point in their program, the current kinds of strings provide a bit of a hobsons choice.

> What you *really* want, I think, is this:
> 1. A concrete decision about how Erlang represents UTF-8 in memory. A canonicalization.
> 2. A single io_data() -> utf8_string() IMPORT function.
> 3. Access to the canonical representation so that dealing with it in Rust/C NIFs and Erlang is not mind bending.
> 4. A single utf8_string() -> io_data() EXPORT function that has a default serialization rule.
> 5. A set of functions that allow me to pick which binary representation is output if the default is unsuitable (like when I really need cast hangul characters to their equivalent broken-down lexemes, for example).
> 6. A special syntax that abstracts the concept of the underlying representation for utf8 in memory.

If we can have all that without overhead of having to parse byte by byte all incoming data to be sure it’s valid utf8 (utf8_raw mode?) that looks like an excellent way forward.

> None of these are trivial issues or should be messed about with lightly.

Agreed. The EEP process and culture of this community is well designed to weed out badly thought through proposals :)

>
> As for syntax, quoting we have so far for the types we have so far is great. The <<"blahblah">> thing for direct access to binaries is great. The "foo" == [$f, $o, $o] sugar is also brilliant. The fact that io_data() is a nested list of stuff can very often make complex, large manipulation of io_data() way faster in Erlang than other languages that have to traverse binary strings to do their work, even if it looks ugly (but again, remembering that the *data* you're dealing with is trees merely represented by strings is key).
>
> So I think Erlang has really gotten all of that right.
>
> But we still SHOULD eventually have a canonical utf8 type.
>
> As for syntax...
>
> I HATE prefix-glyph syntax for quotes. Ugh. Better to just give me a single-letter function name and let me do u("blah") or whatever. Then I don't have to learn anything new, at least, and can use it in a list function or whatever.
>
> I DOUBLY HATE it when new programmes get confused by prefix-glyph syntax. You don't have to teach anyone what a normal-looking quote mark is or how to use or type them.
>
> So if we have to have a special syntax, instead, I would recommend backticks-as-quotes.
>
> 'an_atom'
> "a listy string"
> <<"a binary string">>
> `a canonical utf8 string`

Go-lang made the backtick choice: https://golang.org/ref/spec#String_literals with some interesting sematics. \r chars are stripped out, and their backtick strings can span multiple lines.

I’ll do some digging to see how happy their community is with the choice.

My only concern would be how easy it could be to mistake ` for ‘ when reading code.

> We have a million other kinds of quotes in Japanese that would 「suit」『me』【just】《fine》 but totally screw everyone else over, sort of like german quote angle thingies would were they to be made mandatory -- but I think backticks are universally available without any special input modes (correct me if I'm wrong).

At least for programmers it ought to be available - shells have used it forever.

> The `utf8 string` version would be a strict, canonical equivalent to <<"utf8 string"/utf8>> in memory. I'm actually not sure whether the current binary /utf8 tag forces canonicalization (or if it does, *which* unicode form is canonical in Erlang right now). The canonical representation in memory issue has to be ironed out if you want your JSON situation to improve -- and for you I think this is really the rub (whereas I have very different concerns with unicode strings, and would be a bit annoyed if an optimization in the interest of JSON made dealing with things like client-side input or string forms commonly embedded in binary protocol traffic out on my half of the planet unduly complicated).

JSON handling ought not of course to be the determinant, it’s just this week’s random thing I happen to be working on. I wasn’t working on it last week when the thought came about whether we could steal some ideas from Elixir for string handling.

> As far as what is happening in Erlang right now to clear some of these issues up, since R19 a LOT of unicode changes have been happening, and most of them are really headed in happy directions. I would say that we need to keep this in the back of our minds, but that implementing anything like unicode canonicalization (to the point that we are happy with whatever is decided forever and ever come-come-what-may-and-screw-the-corner-cases, amen) and especially implementing any special syntax to abstract it in code is premature.

That’s *just* a matter of release planning :)

> Dan Gudmundsson has done a TON of excellent work in this area and continues to do so. He has gained a huge amount of knowledge and experience about unicode and how it interacts with current representations, and he would really be the one to ask about what "should be done" and where we are in terms of reaching a unicode string type that makes sense to deal with internally, in NIFs, exported data, etc.

Darn, I should have grabbed Dan on this topic at Code BEAM STO last week!

> What am I missing?

Good question prompted a few more thoughts

It would be nice to be able to use a new string format in more of the places we use strings. So some kind of interpolation for string construction: io_lib:format(`~p`, [atom]). to follow existing conventions, or something more modern:

`some \(fun() -> “interpolated” end) string` - Swift
or
`some #{“interpolated”} string` - Elixir

Though without “printable” protocols I guess these last two wouldn’t fly

Thanks for adding your well thought out ideas and views.

Sean

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Binary string literal syntax

Jesper Louis Andersen-2
In reply to this post by Sean Hinde-4
On Tue, Jun 5, 2018 at 10:57 PM Sean Hinde <[hidden email]> wrote:
My proposal would be to add an alternative notation for binary string literals in Erlang along the lines of:

~s”Some binary string” mapping to <<"Some binary string”>>


The underlying problem is that Erlang is chromodynamic, for a lack of better term[0]. In a chromodynamic language, there is one type, term(), but data of that type has "color" insofar data is used with different intent:

* ISO8859-15 strings
* UTF-8 strings
* Lists of integers, where each integer is a code point
* binary() payloads
* binary() data which has interpretation
* bitstring()
* integers used as sets of bits

And so on. Data is then mapped onto a given subset of term(), namely string(), [non_neg_integer()], [0..255], binary(), iolist(), iodata() etc.

Colors don't mix. We can't have green UTF-8 strings together with blue binary() data. But the onus of keeping the colors apart is on the programmer, not on the system.

Typed languages (that is the nontrivially typed ones) keeps data apart by means of a type system. So there, we can't mix a UTF-8 string with a binary() blob unless we explicitly convert between the types. However, in a chromodynamic language, we need another way to identify the colors, and this leads into the need for explicit syntactic notation to tell them apart.

Worse, our mapping of colorful data to term() is forgetful (or if I may: the mapping is desaturating). So once we have the underlying term(), we don't know from where it came.

History plays an important role of course. binary() was intended for binary() data which are just vectors of bytes. But over time, they've found other uses in Erlang systems:

* strings() - mostly due to better packing of data. Especially on 64bit machines where list cons cells have considerable overhead.
* utf8 encoded strings
* dynamic atoms (because Richard O'Keefe's "Split the Atoms proposal was never implemented). You can run out of atoms, but you cannot run out of binary() if you pay the price of more expensive equality checks.

Given their prominence, I think it would be good to open a discussion on a more succinct syntax for binary() data. Perhaps laced with a discussion about what utf8 strings should be in the system. Over the years, the ubiquity of binary() data has just slowly grown.

Were I the BDFL, I'd probably go with the following scheme:

string() - still a list of code points. The double quote is used: "example"
binary() - still written as <<Binary>>
atom() - still there, used when you need fast equality checks. I'd probably try to figure out how to GC them so they don't have the current limitation, which would open up their use for more cases where we currently use a binary()
text() - A new type specifically for contiguous pieces of unicode. Always encoded as UTF-8. They will have a new syntax, probably `example`. Or perhaps #"example" or ~"example". The latter two has the advantage that they can generalize: ~r".*" etc, but the former is awfully readable.

This introduces a honest-to-god type for textual data where the data is processed as a whole, and it would probably accept a fully backwards compatible representation. We need to discriminate between binary() and textual data at the lowest level anyway. Otherwise you run the risk of mixing color way too often. Conversion routines should verify and crash on conversions which are not allowed.

Rationale: I'd never create the string() type were I to create a new language. A string is not defined as a list of codepoints, but rather as a vector of codepoints (which also means they are immutable). They should support O(1) concatenation (by having the internal representation be either iodata() or a finger tree). But since we have so much legacy code, we are stuck with string(), much like Haskell is where String = [Char].

End of BFDL rant :)


[0] In keeping with CS tradition, I'll take a term from physics and absolutely butcher it by using it in a different context where it doesn't belong. Bear with me, for I have sinned.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Binary string literal syntax

Lloyd R. Prentice-2
In reply to this post by zxq9-2
> Saying "unicode is the standard now, and UTF-8 is The One True Way" is also saying
> "a fantastically complex world of codepoints and construction indicators that allow
> for multiple representations as equivalent is now the standard". That doesn't do
> anything to solve the question of whether there should be a separate string type. It
> also ignores that Windows is natively UTF-16, not UTF-8 (though it works a lot better
> with UTF-8 these days).

Hi Craig,

I’d like to say “right on!,” but I probably shouldn’t participate in this debate.

For one, I’m not a professional programmer. I’ve only, painfully, worked hard to learn Erlang to solve a very specific problem.

So I bring a pragmatic beginner’s mind to this discussion that all are free to discount. And, as an English speaker, I bring an ashamedly provincial bias. Indeed, after seventy some odd years I still struggle to express myself fluently in English.

My pain point is this: I cringe now every time I want to use an Erlang string function. Since my aging memory is not now what it once was, I need to consult the reference manual frequently while I’m programming. And, I must admit, that the new string functions baffle, frustrate and, unreasonably, enrage me. I totally lose flow and concentration when I need to do what was once the simplest string operation, spending many many precious minutes trying to understand the new and improved way of going about it.

So, just a few observations for what they’re worth:

Seems to me that trying to find the one universal digital standard for representing all the wonderful organic and evolving natural languages in the world is an exercise in hubris.

There is no surer road to complexity and bloat than trying to be all things to all people.

But, yes, in this global world, we do need to communicate across natural language domains.

Esperanto is one not terribly successful attempt to do this in the non-digital world. How many among us speak Esperanto?

Yet,  we use translators and translation services quite effectively.

So, perhaps, the glyphs of each language should have their own most efficient and standardized digital representation. And serious intellectual capital should go into writing language-to-language translation packages.

All the best to all,

LRP

Sent from my iPad

> On Jun 6, 2018, at 6:48 PM, [hidden email] wrote:
>
>> On 2018年6月6日水曜日 14時00分20秒 JST Vlad Dumitrescu wrote:
>>
>> - The new string functions work with strings as sequences of lexemes. The
>> "list strings" are lists of characters, so for example calling length() on
>> the two representations of the same text may not return the same value.
>> Most notably, CRLF is a lexeme, but two characters.
>
> To expand on this point (as I've done before) many lexemes used in CJK have multiple constructions that are considered equivalent. Korean hangul is almost a pure example of this, as input is typically done over aggregate lexemes and often re-masked as a single codepoint once the input phase is complete, but not always. Pinyin input works in a similar way but has way more complex aggregate lexemes, though the principle is similar. Even Japanese has a few examples of this (is ぷ a single character or is it [[ふ,゜]] this particular time we encounter it?).
>
> etc.
>
> Saying "unicode is the standard now, and UTF-8 is The One True Way" is also saying "a fantastically complex world of codepoints and construction indicators that allow for multiple representations as equivalent is now the standard". That doesn't do anything to solve the question of whether there should be a separate string type. It also ignores that Windows is natively UTF-16, not UTF-8 (though it works a lot better with UTF-8 these days).
>
> Go read the unicode standard. It's... well, just have fun reading it. I don't anyone who understands *all* of it 100% -- because for most people the first 10% or so "works for me" (for mainstream CJK use probably the first 30% or so).
>
> I suppose all of this is to say that when you compare the enormous number of corner cases in string handling against MERE SYNTAX the syntax is such a trivial issue that it isn't even worth time thinking about. The reason this suject comes up once a year is because we all wish there were a magical set of characters we would write into source files that mean "just make this string work, regardless what it is, because this is hard". We want a syntax that represents the system abstracting away from the underlying data and instead forcing it to mean what *we* mean -- which is insufficiently low-level for much of the work we do as programmers (so we're leaving a lot undefined there, which is bad). At what level do you need to interpret a string? The answer is different for different programs (someone writing a Korean input interpreter is in a very different case than someone writing a chat server). This desire is a kissing cousin of the grand desire for a unified method of l10n and i18n -- which turns out to be just as hard to get right as string handling because every case is different, often in a way that conflicts with another case you have to handle.
>
> Blah blah. Strings are hard, and about half of what we think of as "strings" are really serializations of non-string data anyway.
>
> -Craig
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Binary string literal syntax

Sean Hinde-4
In reply to this post by Jesper Louis Andersen-2
Seems like a good time for a summary. Yell if I have misunderstood or mis-represented

 - There is little appetite for a new ‘bandaid' syntax mapping to <<“utf8-string”/utf8>>
 - Erlang is already on a path towards much improved string handling (although at the cost of making the simple cases more complex?)
 - No-one is arguing we have the ideal solution for strings today
 - There is appetite for a new string type. This should have a distinct dynamic type - text() ?
 - This type ought to be represented internally in a carefully chosen canonical utf-8 format
 - It needs clean and efficient mechanisms for input and output to file / network / nif with utf-8 default but option for many other representations.
 - Backtick quotes have a few votes as new syntax for such a representation. Go-lang has chosen backtick for their raw strings with some interesting semantics.
 - A new text() type would allow io to print these strings as `utf8 string` rather than falling back to binary representation.

Some issues:

 - Adding a new type raises all the usual questions about equality, ordering, conversion (implicit and via a many to many matrix of conversion functions), guards etc. It’s a much larger change than a simple syntax representation mapping to  <<"some string"/utf8>>
 - How do we concatenate them? `Hello` <> `World`?
 - How do we construct them? io_lib:format(`~p`, [atom]). `hello #{Name}` ?
 - how do we incorporate them in other string types? io:format(“~s”, [`text`]). <<“txt”, `utf8 text`, 0>>
 - How do we extract them from binary data? <<T:4/text, Rest/binary>>. What is the meaning of the length parameter? The string module already has a clear definition.
 - What does matching a literal out of binary data mean? <<`Hêllö`, Rest/text>> == <<“Hêllö World”/utf8>>
 - Prefix matching in normal code? `Hello` <> World = `Hello world` 
 - Is there already a suitable internal canonical utf-8 format? OTP team?
 - Lots of other details in the mails from everyone
 - Everything else

Sigils are orthogonal to this discussion (and one of the secondary benefits is pretty nicely realised by using backtick - the more common double quotes would not need to be escaped - yes, JSON).

So,the big question: Did this already reach a level of complexity the language will sink under?

Is it worth spending time fleshing any of this out to an EEP level of detail?

Do we come back in another year?

Would an EEP help the existing work of the OTP team in this area or is there already a clear plan and this would be a distraction?

Sean


Aside: 
We don’t have the BDFL question. Instead Erlang/OTP has a process

One old boss I respected explained a reason big companies like to buy from big companies even at many times the price - his reason was that small companies rely on people and big companies on process. The process will deliver (eventually) even if the people change three times in the middle!





On 7 Jun 2018, at 14:29, Jesper Louis Andersen <[hidden email]> wrote:

On Tue, Jun 5, 2018 at 10:57 PM Sean Hinde <[hidden email]> wrote:
My proposal would be to add an alternative notation for binary string literals in Erlang along the lines of:

~s”Some binary string” mapping to <<"Some binary string”>>


The underlying problem is that Erlang is chromodynamic, for a lack of better term[0]. In a chromodynamic language, there is one type, term(), but data of that type has "color" insofar data is used with different intent:

* ISO8859-15 strings
* UTF-8 strings
* Lists of integers, where each integer is a code point
* binary() payloads
* binary() data which has interpretation
* bitstring()
* integers used as sets of bits

And so on. Data is then mapped onto a given subset of term(), namely string(), [non_neg_integer()], [0..255], binary(), iolist(), iodata() etc.

Colors don't mix. We can't have green UTF-8 strings together with blue binary() data. But the onus of keeping the colors apart is on the programmer, not on the system.

Typed languages (that is the nontrivially typed ones) keeps data apart by means of a type system. So there, we can't mix a UTF-8 string with a binary() blob unless we explicitly convert between the types. However, in a chromodynamic language, we need another way to identify the colors, and this leads into the need for explicit syntactic notation to tell them apart.

Worse, our mapping of colorful data to term() is forgetful (or if I may: the mapping is desaturating). So once we have the underlying term(), we don't know from where it came.

History plays an important role of course. binary() was intended for binary() data which are just vectors of bytes. But over time, they've found other uses in Erlang systems:

* strings() - mostly due to better packing of data. Especially on 64bit machines where list cons cells have considerable overhead.
* utf8 encoded strings
* dynamic atoms (because Richard O'Keefe's "Split the Atoms proposal was never implemented). You can run out of atoms, but you cannot run out of binary() if you pay the price of more expensive equality checks.

Given their prominence, I think it would be good to open a discussion on a more succinct syntax for binary() data. Perhaps laced with a discussion about what utf8 strings should be in the system. Over the years, the ubiquity of binary() data has just slowly grown.

Were I the BDFL, I'd probably go with the following scheme:

string() - still a list of code points. The double quote is used: "example"
binary() - still written as <<Binary>>
atom() - still there, used when you need fast equality checks. I'd probably try to figure out how to GC them so they don't have the current limitation, which would open up their use for more cases where we currently use a binary()
text() - A new type specifically for contiguous pieces of unicode. Always encoded as UTF-8. They will have a new syntax, probably `example`. Or perhaps #"example" or ~"example". The latter two has the advantage that they can generalize: ~r".*" etc, but the former is awfully readable.

This introduces a honest-to-god type for textual data where the data is processed as a whole, and it would probably accept a fully backwards compatible representation. We need to discriminate between binary() and textual data at the lowest level anyway. Otherwise you run the risk of mixing color way too often. Conversion routines should verify and crash on conversions which are not allowed.

Rationale: I'd never create the string() type were I to create a new language. A string is not defined as a list of codepoints, but rather as a vector of codepoints (which also means they are immutable). They should support O(1) concatenation (by having the internal representation be either iodata() or a finger tree). But since we have so much legacy code, we are stuck with string(), much like Haskell is where String = [Char].

End of BFDL rant :)


[0] In keeping with CS tradition, I'll take a term from physics and absolutely butcher it by using it in a different context where it doesn't belong. Bear with me, for I have sinned.


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Binary string literal syntax

Lukas Larsson-8
Hello,

Great discussion and ideas here!

One thing that I've not seen mentioned is; what if the list representation was made more memory efficient? Today its 16 bytes per codepoint vs binaries that are 1-4 byte per codepoint. What if lists only used 8 bytes for each codepoint? what if it used the same as binaries? How would that change this discussion?

On Mon, Jun 11, 2018 at 10:22 AM, Sean Hinde <[hidden email]> wrote:
Would an EEP help the existing work of the OTP team in this area or is there already a clear plan and this would be a distraction?


There is no plan about what should be done in this area. We want to continue developing the possibility to encode and decode protocols. We've had numerous discussions about how we would like to extent the binary syntax (or the syntax in general) in order to make it better for both novice and advanced users of Erlang, but have yet to come up with something that we like. So far our discussions have been mostly about decoding protocols, because we see that as the larger pain point, but maybe we were wrong about that?

Regarding creating a new text type, I'm personally skeptical, but haven't formed a strong opinion on the matter yet. Adding a new type is a huge undertaking and we should be very sure that what we want before doing it.

Lukas

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Binary string literal syntax

Tristan Sloughter-4
That would be great!

Would there be much reason at all to use binary for text if this were the case now that utf is also supported? I suppose it would still be optimal if one is passing around large chunks of >64 bytes of text, but besides that are there any performance reasons to use binaries over lists assuming the memory usage were the same?

Tristan


On Mon, Jun 11, 2018, at 10:47 AM, Lukas Larsson wrote:
Hello,

Great discussion and ideas here!

One thing that I've not seen mentioned is; what if the list representation was made more memory efficient? Today its 16 bytes per codepoint vs binaries that are 1-4 byte per codepoint. What if lists only used 8 bytes for each codepoint? what if it used the same as binaries? How would that change this discussion?

On Mon, Jun 11, 2018 at 10:22 AM, Sean Hinde <[hidden email]> wrote:

Would an EEP help the existing work of the OTP team in this area or is there already a clear plan and this would be a distraction?


There is no plan about what should be done in this area. We want to continue developing the possibility to encode and decode protocols. We've had numerous discussions about how we would like to extent the binary syntax (or the syntax in general) in order to make it better for both novice and advanced users of Erlang, but have yet to come up with something that we like. So far our discussions have been mostly about decoding protocols, because we see that as the larger pain point, but maybe we were wrong about that?

Regarding creating a new text type, I'm personally skeptical, but haven't formed a strong opinion on the matter yet. Adding a new type is a huge undertaking and we should be very sure that what we want before doing it.

Lukas
_______________________________________________
erlang-questions mailing list


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Binary string literal syntax

Lukas Larsson-8


On Tue, Jun 12, 2018 at 5:56 PM Tristan Sloughter <[hidden email]> wrote:
That would be great!

Would there be much reason at all to use binary for text if this were the case now that utf is also supported?

It may still be better to use binaries as text if you don't do much processing on the Erlang side, since the list would contain char() entries, there is still en encoding cost to convert to utf-whatever that may want to be avoided. Also you will read the data from somewhere which is probably utf-something encoded, so at least when doing the initial parsing you will deal with binaries.

I suppose it would still be optimal if one is passing around large chunks of >64 bytes of text, but besides that are there any performance reasons to use binaries over lists assuming the memory usage were the same?

I'm not sure.... One of the things that binaries are good at is matching out sub binaries, i.e. taking <<"foo">> out of <<"foobar">> without having to copy <<"foo">>. In order to do the same with lists a new syntax would have to be added, something like [SubList:24/list | T] and a lot of support in the run-time. Today there are 4 different types of binaries in the run-time, while only 1 list type. If we go down this route we'll end up with 3-4 list types as well, which of course adds to complexity in a lot of places.

I'm fairly confident that we can get the cost of lists down to 8 bytes per cons cell, but it will require re-writing a lot of code all over the place. Getting it down even further will be even more difficult.... but not impossible I think. 

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
12