Binary string literal syntax

classic Classic list List threaded Threaded
30 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Binary string literal syntax

Sean Hinde-4
I’ve been writing a lot of Elixir over the last few months (plus Swift, Java, C) and just came back to Erlang. There are a few things I’ve come to very much like about Elixir I think might be quite useful to bring to Erlang.

The first, and topic of this email, is arguably trivial, but having to surround <<“modern”>> string literals with <<>> is irritating and not getting any less so.

It occurred to me we might usefully steal the sigil notation from Elixir to solve this and add other string literal related syntactic niceties.

My proposal would be to add an alternative notation for binary string literals in Erlang along the lines of:

~s”Some binary string” mapping to <<"Some binary string”>>

This would open the door to other alternative notations. e.g. Elixir has one for regex friendly strings:

~r/foo|bar/

String literals containing quotes could also benefit from different delimiter characters e.g.

~s'{"Name":"Tom","Age":10}’ mapping to <<"{\"Name\":\"Tom\",\"Age\":10}”>>

The Elixir docs cover more options: https://elixir-lang.org/getting-started/sigils.html. There are 8 alternative delimiters, and even user defined sigils and interpolation with #{}

If we didn’t want to go as far as Elixir style sigils we could just use a single prefix char.

LFE uses:

#”Some binary string”

or maybe

~”Some binary string”

If people here on the list think there is anything useful in this, and we can reach some kind of rough consensus on how far it makes sense to go I will put some effort into an EEP.

Sean

 - first post since July 2008. It’s good to be back
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Binary string literal syntax

zxq9-2
On 2018年6月5日火曜日 19時51分32秒 JST Sean Hinde wrote:
> I’ve been writing a lot of Elixir over the last few months (plus Swift, Java, C) and just came back to Erlang. There are a few things I’ve come to very much like about Elixir I think might be quite useful to bring to Erlang.
>
> The first, and topic of this email, is arguably trivial, but having to surround <<“modern”>> string literals with <<>> is irritating and not getting any less so.

I like ideas such as this but believe the core language is not the place to implement them.

Syntax is like ocean plastic -- it lingers forever, works its way into unexpected places and interacts with the metabolism in surprising ways. Adding syntax to an existing language, especially syntax that represents something that is already one can represent, is *dangerous*. It makes the overall language less useful and each addition dramatically increases the cognitive overhead of learning the language.

Typing two different characters two times each, as in <<>> does not exhibit much advantage in keystrokiness over typing a widely gapped sequence ~s''. Magical quote interpretation, as in Python, to avoid the need to escape one or the other kinds of quote literals within a string is handy, but doesn't change much about the difficulty of the language. As Erlang already makes a distinction between single and double quotes, though, diddling with this is a dangerous change (adding syntax) in the interest of glyphy familiarity.

Erlang is a small language. A large part of its utility stems from the fact that the core language is TINY and not littered with a ton of conveniences. This should be strongly protected. Compare the universal readability of C vs the wetware-level incompatibilty among various styles of C++ (or Haskell, for that matter); that's astonishing and sad about C++.

If Elixir does what you like, then use it. There is a strong chance that Elixir will explore so much in terms of language conveniences that it will blow up the way C++, Ruby, and Perl have. That's fine, because we'll learn a lot and Elixir's grandchildren will really be something! Introducing Elixirisms into Erlang, a stick-in-the-mud, traditional, old, venerable language with a very low learning curve at the level of language complexity, is purely a risk.

I would argue that there are *many* places Erlang could benefit from changes, but that inside Erlang is not the place to make them.

I'd like to create a "nearly Erlang" language -- 99% "readable as Erlang" but not actually Erlang -- to advance changes like the one you propose. It could compile on its own or simply rewrite to Erlang proper.

- Semantic whitespace (no more ant-poop or weirdness with terminators before 'end' or 'after' or 'catch')
- Syntactic representation of pipeline checks (or a stdlib pipeline function) to avoid a string of checks forcing newbies into writing highly nested 'case'
- Removal of 'if'
- A reworking of quote representations to make UTF-8 atoms, strings, binary strings, etc. a bit less noisy (maybe the approach you propose, actually, but I would require a lot of time to think about it to get it right).
- A '-pure' declaration (but this could be implemented already in Dialyzer, but just isn't)
- etc.

The place to do these changes is not in Erlang, but in another language that compiles to Erlang or to BEAM.

I'm an Erlang conservationist. Don't corrupt the pristine Erlagnessness with your stays-around-forever-like-marine-plastic syntactic litter!

-Craig
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Binary string literal syntax

José Valim-2
> There is a strong chance that Elixir will explore so much in terms of language conveniences that it will blow up the way C++, Ruby, and Perl have.

Elixir has not added new syntax since v1.0, launched in Sep/2014 (almost 4 years ago, while the language is 6.5 years old).

Granted, 4 years is not a long time in terms of programming languages, especially when compared to Erlang/C++/Ruby, but it shows the trend is not pointing to the direction you have hinted.

--


José Valim
Skype: jv.ptec
Founder and Director of R&D

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Binary string literal syntax

zxq9-2
On 2018年6月6日水曜日 0時59分58秒 JST you wrote:
> > There is a strong chance that Elixir will explore so much in terms of
> language conveniences that it will blow up the way C++, Ruby, and Perl have.
>
> Elixir has not added new syntax since v1.0, launched in Sep/2014 (almost 4
> years ago, while the language is 6.5 years old).
>
> Granted, 4 years is not a long time in terms of programming languages,
> especially when compared to Erlang/C++/Ruby, but it shows the trend is not
> pointing to the direction you have hinted.

Don't take this the wrong way, because I think what you've done is great overall: I mean v1.0 itself represents a syntactic explosion (but it couldn't have happened any other way). You were riding the wave of community language organic evolution, and some weirdness always grew out of that. It is impossible for that *not* to happen. Even the Scheme definition has weird corners!

If you could write Elixir 2.0 from scratch, regardless what user expectations you might crush, would it come out *exactly* the same way? I doubt it. You've learned a HUGE amount about language design through this process and surely you would hope to implement some of that and clean up the corners eventually.

4 years isn't long and I would be absolutely shocked if there are *no* syntactic changes to v1.x later on down the road, doubly shocked if there is never an Elixir v2, and triply shocked if a few linguistic earthquakes don't happen if or when you decide to abdicate as Elixir's benevolent dictator.

(Please never abdicate, by the way. That's a scary thought.)

This isn't an indictment of Elixir -- I don't mean it that way -- language design is HARD. People just can't believe how incredibly hard it is to design a sane language. That's why I am extremely conservative with language syntax suggestions once a language is already in broad use. There are strong, princpled reasons you *haven't* added syntax to Elixir since v1.0, right? That's my point.

-Craig
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Binary string literal syntax

Sam Overdorf
Has anyone ever considered making a string a type and not a list?
Thanks,
Sam Overdorf

On Tue, Jun 5, 2018 at 4:22 PM,  <[hidden email]> wrote:

> On 2018年6月6日水曜日 0時59分58秒 JST you wrote:
>> > There is a strong chance that Elixir will explore so much in terms of
>> language conveniences that it will blow up the way C++, Ruby, and Perl have.
>>
>> Elixir has not added new syntax since v1.0, launched in Sep/2014 (almost 4
>> years ago, while the language is 6.5 years old).
>>
>> Granted, 4 years is not a long time in terms of programming languages,
>> especially when compared to Erlang/C++/Ruby, but it shows the trend is not
>> pointing to the direction you have hinted.
>
> Don't take this the wrong way, because I think what you've done is great overall: I mean v1.0 itself represents a syntactic explosion (but it couldn't have happened any other way). You were riding the wave of community language organic evolution, and some weirdness always grew out of that. It is impossible for that *not* to happen. Even the Scheme definition has weird corners!
>
> If you could write Elixir 2.0 from scratch, regardless what user expectations you might crush, would it come out *exactly* the same way? I doubt it. You've learned a HUGE amount about language design through this process and surely you would hope to implement some of that and clean up the corners eventually.
>
> 4 years isn't long and I would be absolutely shocked if there are *no* syntactic changes to v1.x later on down the road, doubly shocked if there is never an Elixir v2, and triply shocked if a few linguistic earthquakes don't happen if or when you decide to abdicate as Elixir's benevolent dictator.
>
> (Please never abdicate, by the way. That's a scary thought.)
>
> This isn't an indictment of Elixir -- I don't mean it that way -- language design is HARD. People just can't believe how incredibly hard it is to design a sane language. That's why I am extremely conservative with language syntax suggestions once a language is already in broad use. There are strong, princpled reasons you *haven't* added syntax to Elixir since v1.0, right? That's my point.
>
> -Craig
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Binary string literal syntax

José Valim-2
In reply to this post by zxq9-2
Right, all languages have syntax weirdness. And yes, Elixir has more syntactical constructs than Erlang, even being much younger. But there is still a very long way to become comparable to languages like C++ and Perl. And if Erlang can avoid further syntax growth through discipline, I don’t see any reason why Elixir can’t either. :)

Elixir was designed with a macro system and AST in mind, exactly so we can have a core of constructs and derive everything else from this core. That’s the reason why we haven’t added further syntax, because it was designed so we don’t have to.

So I would be surprised if Elixir v2.0 includes more than a couple new constructs, if any. The most likely chance of getting new constructs is because Erlang got them too (say we got a new data type, like when we got maps).

I am fully aware that 4 years is very little time to extract trends from and guess about the future, but I am considerably more optimistic about how this particular aspect of Elixir will grow over time.



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Binary string literal syntax

Sean Hinde-4
In reply to this post by zxq9-2


> On 6 Jun 2018, at 00:45, [hidden email] wrote:
>
> On 2018年6月5日火曜日 19時51分32秒 JST Sean Hinde wrote:
>> I’ve been writing a lot of Elixir over the last few months (plus Swift, Java, C) and just came back to Erlang. There are a few things I’ve come to very much like about Elixir I think might be quite useful to bring to Erlang.
>>
>> The first, and topic of this email, is arguably trivial, but having to surround <<“modern”>> string literals with <<>> is irritating and not getting any less so.
>
> I like ideas such as this but believe the core language is not the place to implement them.
>
> Syntax is like ocean plastic -- it lingers forever, works its way into unexpected places and interacts with the metabolism in surprising ways. Adding syntax to an existing language, especially syntax that represents something that is already one can represent, is *dangerous*. It makes the overall language less useful and each addition dramatically increases the cognitive overhead of learning the language.

I have happy memories of waiting eagerly each year to see what new things would be added to the language. The bit syntax was a truly stunning addition. When sometime later the ability to write literal strings inside a binary was added it was pretty convenient.

The Erlang team has not shied away from adding features where they push the language forward in its practicality for dealing with the protocol related problems of the day.

Back then it was bit twiddling, these days protocols are more string based.

>
> Typing two different characters two times each, as in <<>> does not exhibit much advantage in keystrokiness over typing a widely gapped sequence ~s''. Magical quote interpretation, as in Python, to avoid the need to escape one or the other kinds of quote literals within a string is handy, but doesn't change much about the difficulty of the language. As Erlang already makes a distinction between single and double quotes, though, diddling with this is a dangerous change (adding syntax) in the interest of glyphy familiarity.

Keystroke saving was not really a motivation, though having to add them at both ends of a string is a bunch of navigation as well as the extra keystrokes.

>
> Erlang is a small language. A large part of its utility stems from the fact that the core language is TINY and not littered with a ton of conveniences. This should be strongly protected. Compare the universal readability of C vs the wetware-level incompatibilty among various styles of C++ (or Haskell, for that matter); that's astonishing and sad about C++.
>
> If Elixir does what you like, then use it. There is a strong chance that Elixir will explore so much in terms of language conveniences that it will blow up the way C++, Ruby, and Perl have. That's fine, because we'll learn a lot and Elixir's grandchildren will really be something! Introducing Elixirisms into Erlang, a stick-in-the-mud, traditional, old, venerable language with a very low learning curve at the level of language complexity, is purely a risk.

I guess there is some risk that adding a nicer way to write binary strings would break the camel’s back and make the language more difficult to learn.

I happen to believe that this change would have the opposite affect. It ought to reduce the cognitive overhead for newcomers to the language (“Why do I have to write these weird angle brackets just to get a ‘normal’ string?”), and make the choice between the two string forms less biased by syntax and more based on utility for the rest of us.

I can see your point for the full scope of Elixir sigils, but I think there is a level of addition that would simply just help. Maybe just the #”Syntax"

>
> I'm an Erlang conservationist. Don't corrupt the pristine Erlagnessness with your stays-around-forever-like-marine-plastic syntactic litter!

lol. I also love the considered pace of language additions. Maps took a pretty long time to finalise and they have turned out nice. Having written a decent amount of C++ in the last 10 years I am the last person who would want to wish that on the language :)

>
> -Craig
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Binary string literal syntax

Sean Hinde-4
In reply to this post by Sean Hinde-4
I had an off list mail with some questions that I’d like to share my answers to

The first question was about my use of the word <<“modern”>>

Modern had a few thoughts behind it:
 - It depends on the greyness of your beard ;)
 - The great re-write of libraries dealing with external string data as input to use binaries instead of strings (especially the json libs)
 - That UTF-8 has emerged as the universal standard for string data

It means that dealing with modern Erlang libs and modern data drives the use of binary strings as the default way of writing strings in Erlang programs where we would once have just used list of int strings

>
> Anyway, I feel like Erlang is even less of a string-wrangling scripting language than Elixir and so I think I'd find sigils even less helpful.

As a protocol wrangling language I would argue Erlang has no peers, but many more protocols are string based now than when the bit syntax was invented.

>
>
>> LFE uses:
>>
>> #”Some binary string”
>
>
> That one's pretty attractive to me. I could see myself writing them that way.

One vote in that direction. Thank you !

> My question, though, is what are the trade-offs of using binaries vs list-of-ints for strings? I think the syntax ought to push people towards the "best" answer. Erlang seems to say list-of-ints and Elixir definitely says binaries.

I think if we had a nicer way to write strings as binaries it would push the needle pretty far in the direction of that as the best answer in a lot more cases.

I don’t see a lot of reasons not to use binary strings as the preferred option - they take less memory, they can be compared, indexed, searched etc much more efficiently than lists as strings

Sean
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Binary string literal syntax

Nathaniel Waisbrot
Thanks for copying my email to the list, Sean. That's what happens when I stay up late writing emails.


> - That UTF-8 has emerged as the universal standard for string data


I think this is an important point. Your bin-string-marker proposal would actually be equivalent, I think, to <<"some string"/utf8>> which is a little more of a mouthful and therefore a better argument in favor of it.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Binary string literal syntax

Vlad Dumitrescu-2
Hi!

I have a few thoughts about this. I would favor the proposed syntax, but not if things don't get simpler. What I mean is that there's more to consider.

- Some modules don't handle binary strings, but lists of chars; most notably erl_scan. If the syntaxes are too close, it might be even more confusing when to use which form.
- The new string functions work with strings as sequences of lexemes. The "list strings" are lists of characters, so for example calling length() on the two representations of the same text may not return the same value. Most notably, CRLF is a lexeme, but two characters.
- When working with a textual protocol, it's still quite often that one would use <<"prefix"/utf8, Rest/binary>>, where the current syntax still has to be used. It might be confusing? 
- The predefined type string() is  still [char()], and for binary strings there is unicode:chardata(), which in not necessarily obvious (as these are handled by the string module).

best regards,
Vlad


On Wed, Jun 6, 2018 at 1:37 PM Nathaniel Waisbrot <[hidden email]> wrote:
Thanks for copying my email to the list, Sean. That's what happens when I stay up late writing emails.


> - That UTF-8 has emerged as the universal standard for string data


I think this is an important point. Your bin-string-marker proposal would actually be equivalent, I think, to <<"some string"/utf8>> which is a little more of a mouthful and therefore a better argument in favor of it.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Binary string literal syntax

Sean Hinde-4


On 6 Jun 2018, at 14:00, Vlad Dumitrescu <[hidden email]> wrote:

Hi!

I have a few thoughts about this. I would favor the proposed syntax, but not if things don't get simpler. What I mean is that there's more to consider.

I was aware of having missed a few details, and aware I was undoubtedly unaware of more :)


- Some modules don't handle binary strings, but lists of chars; most notably erl_scan. If the syntaxes are too close, it might be even more confusing when to use which form.

Very true, though equally some modules don’t handle lists of chars. erl_scan is a big one, but I guess we are all used to the endless round of list to binary and vice versa in these cases.

- The new string functions work with strings as sequences of lexemes. The "list strings" are lists of characters, so for example calling length() on the two representations of the same text may not return the same value. Most notably, CRLF is a lexeme, but two characters.

That is a big question. How should they be represented? I was happily assuming UTF-8, but maybe it would make more sense for them to be compatible with the new string module and be stored as lexeme sequences.

Looking around it seems there are a good range of sensible options. Elixir defaults to string literals being utf8, Swift uses Unicode scalars in their internal string representation forcing conversion to get a byte based representation.

With my protocol hat on I think I would pick utf8 as that is the most likely external representation and in many cases we would never need to convert and hence be efficient, but I can see arguments for this being poor design for a language.

- When working with a textual protocol, it's still quite often that one would use <<"prefix"/utf8, Rest/binary>>, where the current syntax still has to be used. It might be confusing?

<<#”prefix”, Rest/binary>> ?

Definitely room for deeper thought here.

- The predefined type string() is  still [char()], and for binary strings there is unicode:chardata(), which in not necessarily obvious (as these are handled by the string module).

There is a type for unicode_binary() in the unicode module which refers to a utf8 binary string. The unicode.erl docs go as far as saying:

"The default Unicode encoding in Erlang is in binaries UTF-8, which is also the format in which built-in functions and libraries in OTP expect to find binary Unicode data”

There is also a strange example in the string.erl document where this binary <<"abc..åäö”>> is not stored as UTF-8 but instead as latin-1. Having an unambiguous way to represent a UTF-8 string literal would also clear this up.

That seems to point in a clear direction.

Excellent input, thank you.

Sean


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Binary string literal syntax

Fred Hebert-2

On Wed, Jun 6, 2018 at 8:56 AM, Sean Hinde <[hidden email]> wrote:

"The default Unicode encoding in Erlang is in binaries UTF-8, which is also the format in which built-in functions and libraries in OTP expect to find binary Unicode data”

There is also a strange example in the string.erl document where this binary <<"abc..åäö”>> is not stored as UTF-8 but instead as latin-1. Having an unambiguous way to represent a UTF-8 string literal would also clear this up.

That seems to point in a clear direction.

 
A clarification here. In Erlang, you have to be aware of the following possible encodings:
  • "abcdef": a string, which is made of straight up unicode codepoints. This means that if you write [16#1f914] you'll quite literally get "🤔" as a string, with no regards to encoding
  • <<"abcdef">> as a binary string, which is shorthand for <<$a, $b, $c, $d, $e, $f>>. Which is an old standard list of integers transformed as a binary. By default this format does not support unicode encodings, and if you put a value that is too large in there (such as 16#1f914) by declaring a binary like <<"🤔">>, you will instead find yourself with an overflow, and the final binary <<20>>.
  • <<"abcdef"/utf8>> as a binary string that is unicode encoded as utf-8. This one would work to support emojis
  • <<"abcdef"/utf16>> as a binary string that is unicode encoded as utf-16
  • <<"abcdef"/utf32>> as a binary string that is unicode encoded as utf-32
  • ["abcdef", <<"abcdef"/utf8>>]: iodata type list that can support multiple inputs. As far as I remember, your list can be codepoints as usual, but you'll want all the binaries to be the same encoding (ideally utf-8) to prevent issues where encodings get mixed

When the standard library functions say they "expect to find utf-8 by default", it means that when you call functions such as the new ones in the string module, or those in the unicode module where parameters can be given (i.e. unicode:characters_to_binary/1-3), if nothing is specified, then utf-8 is assumed for binaries. But it does not mean that the literal binary strings you write in code are assumed to be utf-8 by default. That's confusing, but yeah.

Aside from that, I would say that the choices Elixir made have one risky side to them (the same is possible in Erlang but I'm calling it out because I've seen it a few times in Elixir samples and Erlang has not historically had as many examples of string handling). Because strings are utf8 binaries by default in Elixir, whenever you feel like pattern matching iteratively, you may do something like:

<<head::utf8, rest::binary>> which in Erlang would be <<Head/utf8, Rest/binary>>. The risk of doing this is that this fetches text by codepoint, whereas when doing text processing, it is often better to do it by grapheme. The best example for that is the family emoji. By default, it could be just a single codepoint, encoded on many bytes, giving: 👪

That's fine and good, but the problem comes from the fact that graphical (and logical) representation is not equal to the underlying codes creating the final character. Those exist for all kinds of possible ligatures and assemblies of "character parts" in various languages, but for Emojis, you can also make a family by combining individual people: 👩‍👩‍👦‍👦 is a family composed of 4 components with combining marks: 👩 + 👩  + 👦 + 👦, where + is a special combining mark (a zero width joiner) between two women and two boys. If you go ahead and consume that emoji using the /utf8 modifier, you'll break the family apart and change the semantic meaning of the text.

If you edit the text in a text editor that traditionally has good support for locales and all kinds of per-language rules, such as Microsoft Word (the only one I know to do a great job of automatically handling half-width spaces and non-breakable spaces when language asks for it), pressing backspace on 👩‍👩‍👦‍👦 will remove the whole family as one unit. If you do it in FireFox or Chrome, deleting that one 'character' will take you 7 backstrokes: one for each 'person' and one for each zero-width joining character. Slack will consider them to be a single character and visual studio code behaves like the browsers (even if both are electron apps), and notepad.exe or many terminal emulators will instead expand them as 4 people and implicitly drop the zero-width joining marks.

If you want to deal with unicode strings, you really should use the string functions from the string module (String in Elixir), and work on graphemes or codepoints depending on the context. One interesting thing there is that you can use these to return you strings re-built up as graphemes using the to_graphemes functions in either language:

1> string:to_graphemes("ß↑e̊").
[223,8593,[101,778]]
2> string:to_graphemes(<<"ß↑e̊"/utf8>>).
[223,8593,[101,778]]

This lets you take any unicode string, and turn it into a list that is safe to iterate using calls such as lists:map/2 or lists comprehensions. This can only be done through iodata(), and this might even be a better format than what you'd get with just default UTF-8 binary strings. Pattern matching is still risky there. Ideally you'd possibly want to do a round of normalization first, so that characters that can be encoded in more than one way (say â which can be a single codepoint or a+^ as two points) are forced into a uniform representation.

The thing that I'm worried about is how could we make the richness (and pitfalls!) of Unicode handling easier to deal with. So far I've been pleasantly surprised that having no native string type and using codepoints by default did force me to learn a bunch about Unicode and how to do it right, but it's difficult to think that this is the optimal path for everyone.

If I had to argue for something, it would be that a good "beginner" string type would be an opaque one that inherently carries its own encoding, and cannot be pattern-matched on unless you use a 'graphemed' + normalized iodata structure. If you wanted to switch to codepoints for handling, then you could convert it to a binary or to another type. But even then this would have a weakness because you would necessarily be forced to convert from say, a utf-8 byte stream coming from a socket, onto a different format: this is exactly what is annoying people today when they just want the damn strings to use "abc" because it's shorter to write.

I personally think that this is a clash between correctness and convenience. Currently Erlang is not necessarily 'correct', but it at least does not steer you entirely wrong through convenience since using utf8 (the default many people want) is cumbersome. I'd personally go for a 'correct' option (strongly typed strings that require conversions between formats based on context and usage), but I fear that this thread and most complaints about syntax worry first about convenience, and I don't know that they're easy to reconcile.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Binary string literal syntax

Fred Hebert-2
On Wed, Jun 6, 2018 at 9:51 AM, Fred Hebert <[hidden email]> wrote:

That's fine and good, but the problem comes from the fact that graphical (and logical) representation is not equal to the underlying codes creating the final character. Those exist for all kinds of possible ligatures and assemblies of "character parts" in various languages, but for Emojis, you can also make a family by combining individual people: 👩‍👩‍👦‍👦 is a family composed of 4 components with combining marks: 👩 + 👩  + 👦 + 👦, where + is a special combining mark (a zero width joiner) between two women and two boys. If you go ahead and consume that emoji using the /utf8 modifier, you'll break the family apart and change the semantic meaning of the text.

If you edit the text in a text editor that traditionally has good support for locales and all kinds of per-language rules, such as Microsoft Word (the only one I know to do a great job of automatically handling half-width spaces and non-breakable spaces when language asks for it), pressing backspace on 👩‍👩‍👦‍👦 will remove the whole family as one unit. If you do it in FireFox or Chrome, deleting that one 'character' will take you 7 backstrokes: one for each 'person' and one for each zero-width joining character. Slack will consider them to be a single character and visual studio code behaves like the browsers (even if both are electron apps), and notepad.exe or many terminal emulators will instead expand them as 4 people and implicitly drop the zero-width joining marks.

Well this kind of illustrates my point. Instead of seeing the unicode family as 4 joined people in one unit (see https://emojipedia.org/family-woman-woman-boy-boy/), it appears gmail has expanded the family into 4 distinct people. So please use the emojipedia reference when reading my previous e-mail.


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Binary string literal syntax

Loïc Hoguin-3
In reply to this post by Sean Hinde-4
Alternatively, allow making use of Unicode and support the characters «
and » alongside << and >>. This is already what I use when writing code
(using vim's conceal feature) and it makes the source a lot prettier to
read.

Sames goes for the various arrows.

I think another language already does this, was it Ocaml?

On 06/05/2018 07:51 PM, Sean Hinde wrote:

> I’ve been writing a lot of Elixir over the last few months (plus Swift, Java, C) and just came back to Erlang. There are a few things I’ve come to very much like about Elixir I think might be quite useful to bring to Erlang.
>
> The first, and topic of this email, is arguably trivial, but having to surround <<“modern”>> string literals with <<>> is irritating and not getting any less so.
>
> It occurred to me we might usefully steal the sigil notation from Elixir to solve this and add other string literal related syntactic niceties.
>
> My proposal would be to add an alternative notation for binary string literals in Erlang along the lines of:
>
> ~s”Some binary string” mapping to <<"Some binary string”>>
>
> This would open the door to other alternative notations. e.g. Elixir has one for regex friendly strings:
>
> ~r/foo|bar/
>
> String literals containing quotes could also benefit from different delimiter characters e.g.
>
> ~s'{"Name":"Tom","Age":10}’ mapping to <<"{\"Name\":\"Tom\",\"Age\":10}”>>
>
> The Elixir docs cover more options: https://elixir-lang.org/getting-started/sigils.html. There are 8 alternative delimiters, and even user defined sigils and interpolation with #{}
>
> If we didn’t want to go as far as Elixir style sigils we could just use a single prefix char.
>
> LFE uses:
>
> #”Some binary string”
>
> or maybe
>
> ~”Some binary string”
>
> If people here on the list think there is anything useful in this, and we can reach some kind of rough consensus on how far it makes sense to go I will put some effort into an EEP.
>
> Sean
>
>   - first post since July 2008. It’s good to be back
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions
>

--
Loïc Hoguin
https://ninenines.eu
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Binary string literal syntax

Sean Hinde-4
In reply to this post by Fred Hebert-2
---
snip - many good and interesting reasons angels fear to tread here
---

>
> If I had to argue for something, it would be that a good "beginner" string type would be an opaque one that inherently carries its own encoding, and cannot be pattern-matched on unless you use a 'graphemed' + normalized iodata structure. If you wanted to switch to codepoints for handling, then you could convert it to a binary or to another type. But even then this would have a weakness because you would necessarily be forced to convert from say, a utf-8 byte stream coming from a socket, onto a different format: this is exactly what is annoying people today when they just want the damn strings to use "abc" because it's shorter to write.

I would fear an even louder chorus if we created a “beginner” string type that was not useable in most contexts a beginner might want to use a string!

Apple have always gone deep on this topic and the solution in Swift is quite ok, at the cost of having to explicitly export to utf-8 to send over the network / store in a file.

>
> I personally think that this is a clash between correctness and convenience. Currently Erlang is not necessarily 'correct', but it at least does not steer you entirely wrong through convenience since using utf8 (the default many people want) is cumbersome. I'd personally go for a 'correct' option (strongly typed strings that require conversions between formats based on context and usage), but I fear that this thread and most complaints about syntax worry first about convenience, and I don't know that they're easy to reconcile.

As an engineering tool for dealing with protocols I would describe Erlang has having tended towards the pragmatic, which could be described as a fine line between correct and convenient.

A new notation as a shorthand for utf8 string literals combined with the power of full binary string encoding and lists as code points doesn’t seem like it would be too misleading.

Of course in this imaginary language extension we can write any kind of program we like:

~u8”utf-8 string”
~u16”utf-16 string”
~u”unicode string”

Though I am with zxq9 that any changes really ought not to make the language worse or less understandable.

Rust seems to have got in a mess here:
https://github.com/rust-lang/rust/blob/master/src/grammar/raw-string-literal-ambiguity.md

Go picked utf8 for literal strings without too much complaint,

The slightly wicked side of me would find great enjoyment in a Hacker News post proclaiming Erlang as the one true language for string processing :)

Sean


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Binary string literal syntax

zxq9-2
In reply to this post by Sean Hinde-4
On 2018年6月6日水曜日 11時41分01秒 JST Sean Hinde wrote:

> As a protocol wrangling language I would argue Erlang has no peers, but many more protocols are string based now than when the bit syntax was invented.

By count this is patently false. Most protocols are binary based, as the number of ad hoc binary protocols created for IoT vasty outnumber the handful of prolific string-based ones. Can you think of a better language for IoT protocol wrangling than Erlang?

Sure, most people have no clue how to program sockets these days so they use HTTP for everything -- but that isn't *most* protocols, that's a relatively small set of overwhelmingly *prolific* protocols. My prediction is that binary protocols will become more prolific as the extremely limited shared resource of wireless bandwidth becomes more and more saturated (and I don't think compression is a fix-all here, though it certainly helps).

It is very hard to tell whether Erlangers as a whole encounter binary or string-based protocols more *often* because a survey on that sort of thing is very hard to pull off -- but it is pretty clear that the sexy, high-profile cases tend to involve the web and that is indeed mostly string-based stuff.

There is also the issue with file-format deconstruction (which I do a fair amount of) and a LOT of it is binary.

Better handling of UTF-8 (or unicode, more generally, as remember Windows is natively UTF-16...) would be nice as a single case to latch on to and really focus on supporting from every angle -- but it is VERY FAR from being The One Grand Unifying Case.

I've commented on quite a few threads about encodings, strings, why graphemes and lexemes matter, and the myopia that comes with dealing with mostly European originated languages. I live in Japan and deal with Shift-JIS, JIS, JIS7, ISO-2022, EUC, etc. variants all the time. Is that a common case? Not in the West, but it is totally normal here -- especially dealing with web data.

* Binary protocols are alive and well
* The old encodings are far from dead.
* You have a good point about improvements being possible and desirable.
* The best way to proceed is not clear.
* The unicode-correctish improvements to the string and unicode modules are very encouraging.

-Craig
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Binary string literal syntax

zxq9-2
In reply to this post by Vlad Dumitrescu-2
On 2018年6月6日水曜日 14時00分20秒 JST Vlad Dumitrescu wrote:

> - The new string functions work with strings as sequences of lexemes. The
> "list strings" are lists of characters, so for example calling length() on
> the two representations of the same text may not return the same value.
> Most notably, CRLF is a lexeme, but two characters.

To expand on this point (as I've done before) many lexemes used in CJK have multiple constructions that are considered equivalent. Korean hangul is almost a pure example of this, as input is typically done over aggregate lexemes and often re-masked as a single codepoint once the input phase is complete, but not always. Pinyin input works in a similar way but has way more complex aggregate lexemes, though the principle is similar. Even Japanese has a few examples of this (is ぷ a single character or is it [[ふ,゜]] this particular time we encounter it?).

etc.

Saying "unicode is the standard now, and UTF-8 is The One True Way" is also saying "a fantastically complex world of codepoints and construction indicators that allow for multiple representations as equivalent is now the standard". That doesn't do anything to solve the question of whether there should be a separate string type. It also ignores that Windows is natively UTF-16, not UTF-8 (though it works a lot better with UTF-8 these days).

Go read the unicode standard. It's... well, just have fun reading it. I don't anyone who understands *all* of it 100% -- because for most people the first 10% or so "works for me" (for mainstream CJK use probably the first 30% or so).

I suppose all of this is to say that when you compare the enormous number of corner cases in string handling against MERE SYNTAX the syntax is such a trivial issue that it isn't even worth time thinking about. The reason this suject comes up once a year is because we all wish there were a magical set of characters we would write into source files that mean "just make this string work, regardless what it is, because this is hard". We want a syntax that represents the system abstracting away from the underlying data and instead forcing it to mean what *we* mean -- which is insufficiently low-level for much of the work we do as programmers (so we're leaving a lot undefined there, which is bad). At what level do you need to interpret a string? The answer is different for different programs (someone writing a Korean input interpreter is in a very different case than someone writing a chat server). This desire is a kissing cousin of the grand desire for a unified method of l10n and i18n -- which turns out to be just as hard to get right as string handling because every case is different, often in a way that conflicts with another case you have to handle.

Blah blah. Strings are hard, and about half of what we think of as "strings" are really serializations of non-string data anyway.

-Craig
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Binary string literal syntax

Sean Hinde-4
In reply to this post by zxq9-2


> On 7 Jun 2018, at 00:21, [hidden email] wrote:
>
> On 2018年6月6日水曜日 11時41分01秒 JST Sean Hinde wrote:
>
>> As a protocol wrangling language I would argue Erlang has no peers, but many more protocols are string based now than when the bit syntax was invented.
>
> By count this is patently false. Most protocols are binary based, as the number of ad hoc binary protocols created for IoT vasty outnumber the handful of prolific string-based ones. Can you think of a better language for IoT protocol wrangling than Erlang?

No arguments from me on the suitability of Erlang for protocol wrangling. And these string based ones are definitely prolific. I spent today dealing with json in Erlang for some banking protocol

>
> Sure, most people have no clue how to program sockets these days so they use HTTP for everything -- but that isn't *most* protocols, that's a relatively small set of overwhelmingly *prolific* protocols. My prediction is that binary protocols will become more prolific as the extremely limited shared resource of wireless bandwidth becomes more and more saturated (and I don't think compression is a fix-all here, though it certainly helps).

I don’t think it really matters how we count. Text based protocols are here and Erlang ought to provide a great programming environment for them too.

> Better handling of UTF-8 (or unicode, more generally, as remember Windows is natively UTF-16...) would be nice as a single case to latch on to and really focus on supporting from every angle -- but it is VERY FAR from being The One Grand Unifying Case.
>
> I've commented on quite a few threads about encodings, strings, why graphemes and lexemes matter, and the myopia that comes with dealing with mostly European originated languages. I live in Japan and deal with Shift-JIS, JIS, JIS7, ISO-2022, EUC, etc. variants all the time. Is that a common case? Not in the West, but it is totally normal here -- especially dealing with web data.
>
> * Binary protocols are alive and well
> * The old encodings are far from dead.
> * You have a good point about improvements being possible and desirable.
> * The best way to proceed is not clear.
> * The unicode-correctish improvements to the string and unicode modules are very encouraging.

Nice summary. You have obviously thought about this a lot. Any thoughts on a better solution? What would you do?

Maybe a hypothetical new string literal type treated as unicode internally but with transparent conversion to utf-8 by default when sent to io (with the option to override)? I get Japan, but utf8 is a sane default.

Or maybe some new slick syntax to create a string literal in any encoding.

The bit syntax was designed for picking apart bit twiddling telecom protocols. It was clearly not designed with the primary goal of representing alternative forms of string literals. It’s just not what you would choose for that application.

Sean


>
> -Craig
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Binary string literal syntax

zxq9-2
On 2018年6月7日木曜日 0時56分29秒 JST you wrote:

>
> > On 7 Jun 2018, at 00:21, [hidden email] wrote:
> >
> > On 2018年6月6日水曜日 11時41分01秒 JST Sean Hinde wrote:
> >
> >> As a protocol wrangling language I would argue Erlang has no peers, but many more protocols are string based now than when the bit syntax was invented.
> >
> > By count this is patently false. Most protocols are binary based, as the number of ad hoc binary protocols created for IoT vasty outnumber the handful of prolific string-based ones. Can you think of a better language for IoT protocol wrangling than Erlang?
>
> No arguments from me on the suitability of Erlang for protocol wrangling. And these string based ones are definitely prolific. I spent today dealing with json in Erlang for some banking protocol

...

> > * Binary protocols are alive and well
> > * The old encodings are far from dead.
> > * You have a good point about improvements being possible and desirable.
> > * The best way to proceed is not clear.
> > * The unicode-correctish improvements to the string and unicode modules are very encouraging.
>
> Nice summary. You have obviously thought about this a lot. Any thoughts on a better solution? What would you do?
>
> Maybe a hypothetical new string literal type treated as unicode internally but with transparent conversion to utf-8 by default when sent to io (with the option to override)? I get Japan, but utf8 is a sane default.
>
> Or maybe some new slick syntax to create a string literal in any encoding.
>
> The bit syntax was designed for picking apart bit twiddling telecom protocols. It was clearly not designed with the primary goal of representing alternative forms of string literals. It’s just not what you would choose for that application.

The main problem I see with this particular example is that you feel you were dealing with a "string-based protocol" because you were dealing with JSON.

You weren't -- JSON is a list of trees. It is serialized as a string, and strings are used to represent things in JSON that JSON itself is *dramatically* unsuited for, so "eveything is a string" seems reasonable to people who don't know anything about type systems or were hustled into pushing a "lipstick on a chicken" prototype into production.

That last case is so common that a lot of new coders haven't ever seen anything *but* JSON in practice. That doesn't mean we should optimize for wrongness.

The point of exacerbation is that you are using a JSON serializer that outputs lists of trees of pairs that contain binary snippets instead of lists as the string representations (Jiffy, I imagine). That isn't the best way to deal with strings in Erlang, imo.

So we have a conflation of issues here:
- Strings (or more broadly, io_data()) in Erlang can *actually* represent Unicode types because they can represent things as lexemes not just a flat array of codepoints. That's actually quite advanced.
- Binaries are just that: binaries. They were indeed never intended for advanced string processing.
- Binaries *can* represent strings, are more compact in memory and are easier to deal with in NIFs, which is why Jiffy uses them.
- Jiffy is the most common JSON serializer for Erlang.

Not a single of these issues is addressed or made easier to deal with by a new syntax that equates to <<"foo"/utf8>>. In fact, the /utf8 binary identifier has only been brought up a few times in this thread because it isn't the point.

What you *really* want, I think, is this:
1. A concrete decision about how Erlang represents UTF-8 in memory. A canonicalization.
2. A single io_data() -> utf8_string() IMPORT function.
3. Access to the canonical representation so that dealing with it in Rust/C NIFs and Erlang is not mind bending.
4. A single utf8_string() -> io_data() EXPORT function that has a default serialization rule.
5. A set of functions that allow me to pick which binary representation is output if the default is unsuitable (like when I really need cast hangul characters to their equivalent broken-down lexemes, for example).
6. A special syntax that abstracts the concept of the underlying representation for utf8 in memory.

None of these are trivial issues or should be messed about with lightly.

As for syntax, quoting we have so far for the types we have so far is great. The <<"blahblah">> thing for direct access to binaries is great. The "foo" == [$f, $o, $o] sugar is also brilliant. The fact that io_data() is a nested list of stuff can very often make complex, large manipulation of io_data() way faster in Erlang than other languages that have to traverse binary strings to do their work, even if it looks ugly (but again, remembering that the *data* you're dealing with is trees merely represented by strings is key).

So I think Erlang has really gotten all of that right.

But we still SHOULD eventually have a canonical utf8 type.

As for syntax...

I HATE prefix-glyph syntax for quotes. Ugh. Better to just give me a single-letter function name and let me do u("blah") or whatever. Then I don't have to learn anything new, at least, and can use it in a list function or whatever.

I DOUBLY HATE it when new programmes get confused by prefix-glyph syntax. You don't have to teach anyone what a normal-looking quote mark is or how to use or type them.

So if we have to have a special syntax, instead, I would recommend backticks-as-quotes.

'an_atom'
"a listy string"
<<"a binary string">>
`a canonical utf8 string`

We have a million other kinds of quotes in Japanese that would 「suit」『me』【just】《fine》 but totally screw everyone else over, sort of like german quote angle thingies would were they to be made mandatory -- but I think backticks are universally available without any special input modes (correct me if I'm wrong).

The `utf8 string` version would be a strict, canonical equivalent to <<"utf8 string"/utf8>> in memory. I'm actually not sure whether the current binary /utf8 tag forces canonicalization (or if it does, *which* unicode form is canonical in Erlang right now). The canonical representation in memory issue has to be ironed out if you want your JSON situation to improve -- and for you I think this is really the rub (whereas I have very different concerns with unicode strings, and would be a bit annoyed if an optimization in the interest of JSON made dealing with things like client-side input or string forms commonly embedded in binary protocol traffic out on my half of the planet unduly complicated).

As far as what is happening in Erlang right now to clear some of these issues up, since R19 a LOT of unicode changes have been happening, and most of them are really headed in happy directions. I would say that we need to keep this in the back of our minds, but that implementing anything like unicode canonicalization (to the point that we are happy with whatever is decided forever and ever come-come-what-may-and-screw-the-corner-cases, amen) and especially implementing any special syntax to abstract it in code is premature.

Dan Gudmundsson has done a TON of excellent work in this area and continues to do so. He has gained a huge amount of knowledge and experience about unicode and how it interacts with current representations, and he would really be the one to ask about what "should be done" and where we are in terms of reaching a unicode string type that makes sense to deal with internally, in NIFs, exported data, etc.

What am I missing?

-Craig
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Binary string literal syntax

Loïc Hoguin-3
In reply to this post by Sean Hinde-4
On 06/07/2018 12:56 AM, Sean Hinde wrote:

>
>
>> On 7 Jun 2018, at 00:21, [hidden email] wrote:
>>
>> On 2018年6月6日水曜日 11時41分01秒 JST Sean Hinde wrote:
>>
>>> As a protocol wrangling language I would argue Erlang has no peers, but many more protocols are string based now than when the bit syntax was invented.
>>
>> By count this is patently false. Most protocols are binary based, as the number of ad hoc binary protocols created for IoT vasty outnumber the handful of prolific string-based ones. Can you think of a better language for IoT protocol wrangling than Erlang?
>
> No arguments from me on the suitability of Erlang for protocol wrangling. And these string based ones are definitely prolific. I spent today dealing with json in Erlang for some banking protocol
>
>>
>> Sure, most people have no clue how to program sockets these days so they use HTTP for everything -- but that isn't *most* protocols, that's a relatively small set of overwhelmingly *prolific* protocols. My prediction is that binary protocols will become more prolific as the extremely limited shared resource of wireless bandwidth becomes more and more saturated (and I don't think compression is a fix-all here, though it certainly helps).
>
> I don’t think it really matters how we count. Text based protocols are here and Erlang ought to provide a great programming environment for them too.

But they're on the way out. You won't find many new text-based
protocols, and for good reasons. Even HTTP/2 went binary (and QUIC/HTTP
will do the same).

Plain-text is still king for content, but the trend has been toward
binaries in recent years. Look at the number of binary serialization
formats that popped up. Of course, it will be harder to take over JSON.

Cheers,

--
Loïc Hoguin
https://ninenines.eu
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
12