Strings - deprecated functions

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
48 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Strings - deprecated functions

Lloyd R. Prentice-2

Dear Gods of Erlang,

 

"This module has been reworked in Erlang/OTP 20 to handle unicode:chardata() and operate on grapheme clusters. The old functions that only work on Latin-1 lists as input are still available but should not be used. They will be deprecated in Erlang/OTP 21."

 

I'm sorry. I've brought up this issue before and got lots of push back.

 

But every time I look up tried and true and long-used string functions to find that they are deprecated and will be dropped in future Erlang releases my blood pressure soars. Both my wife and my doctor tell me that at my age this is a dangerous thing.

 

I do understand the importance and necessity of Unicode. And applaud the addition of Unicode functions.

 

But the deprecated string functions have a long history. The English language and Latin-1 characters are widely used around the world. 

 

Yes, it should be easy for programmers to translate code from one user language to another. But I'm not convinced that the Gods of Erlang have found the optimal solution by dropping all Latin-1 string functions.

 

My particular application is directed toward English speakers. So, until further notice, I have no use for Unicode.

 

I don't want to sound like nationalist pig, but I think dropping the Latin-1 string functions from future Erlang releases is a BIG mistake.

 

I look up tokens/2, a function that I use fairly frequently, and I see that it's deprecated. I look up the suggested replacement and I see lexemes/2.

 

So I ask, what the ... is a lexeme? I look it up in Merriam-Webster and I see that a lexeme is  "a meaningful linguistic unit." 

 

Meaning what? I just want to turn "this and that" into "This And That."

 

I read further in the Erlang docs and I see "grapheme cluster."  WHAT THE ... IS GRAPHEME CLUSTER?

 

I look up "grapheme" in Merriam-Webster. Oh it is now all so clear: "a unit of a writing system."

 

Ah yes, grapheme is defined in the docs. But I have to read and re-read the definition to understand what the God's of Erlang mean by a "graphene cluster." And I'm still not sure I get it.

 

It sounds like someone took a linguistics class and is trying to show off.

 

But now I've spent 30 minutes--- time that I don't have to waste trying to figure out how do a simple manipulation of "this and that." Recurse the next time I want to look up a string function in the Erlang docs.

 

SOLUTION

 

Keep the Latin-1 string functions. Put them in a separate library if necessary. Or put the new Unicode functions in a separate library. But don't arbitrarily drop them.

 

Some folks have suggested that I maintain my own library of the deprecated Latin1 functions. But why should I have to do that? How does that help other folks with the same issue?

 

Bottom line: please please please do not drop the existing Latin-1 string functions.

 

Please don't.

 

Best wishes,

 

LRP

 

 

 

 

 

 

 

 

 

 

 

 


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Strings - deprecated functions

Grzegorz Junka

Dear Lloyd,

Isn't this more about documentation than the code? What I am reading is that you want to keep the old functions because you don't understand how the new functions work. Shouldn't you rather ask for a more clear documentation? Is there anything in the old functions that is not supported in the new functions?

GrzegorzJ


On 22/11/2017 19:43, [hidden email] wrote:

Dear Gods of Erlang,

 

"This module has been reworked in Erlang/OTP 20 to handle unicode:chardata() and operate on grapheme clusters. The old functions that only work on Latin-1 lists as input are still available but should not be used. They will be deprecated in Erlang/OTP 21."

 

I'm sorry. I've brought up this issue before and got lots of push back.

 

But every time I look up tried and true and long-used string functions to find that they are deprecated and will be dropped in future Erlang releases my blood pressure soars. Both my wife and my doctor tell me that at my age this is a dangerous thing.

 

I do understand the importance and necessity of Unicode. And applaud the addition of Unicode functions.

 

But the deprecated string functions have a long history. The English language and Latin-1 characters are widely used around the world. 

 

Yes, it should be easy for programmers to translate code from one user language to another. But I'm not convinced that the Gods of Erlang have found the optimal solution by dropping all Latin-1 string functions.

 

My particular application is directed toward English speakers. So, until further notice, I have no use for Unicode.

 

I don't want to sound like nationalist pig, but I think dropping the Latin-1 string functions from future Erlang releases is a BIG mistake.

 

I look up tokens/2, a function that I use fairly frequently, and I see that it's deprecated. I look up the suggested replacement and I see lexemes/2.

 

So I ask, what the ... is a lexeme? I look it up in Merriam-Webster and I see that a lexeme is  "a meaningful linguistic unit." 

 

Meaning what? I just want to turn "this and that" into "This And That."

 

I read further in the Erlang docs and I see "grapheme cluster."  WHAT THE ... IS GRAPHEME CLUSTER?

 

I look up "grapheme" in Merriam-Webster. Oh it is now all so clear: "a unit of a writing system."

 

Ah yes, grapheme is defined in the docs. But I have to read and re-read the definition to understand what the God's of Erlang mean by a "graphene cluster." And I'm still not sure I get it.

 

It sounds like someone took a linguistics class and is trying to show off.

 

But now I've spent 30 minutes--- time that I don't have to waste trying to figure out how do a simple manipulation of "this and that." Recurse the next time I want to look up a string function in the Erlang docs.

 

SOLUTION

 

Keep the Latin-1 string functions. Put them in a separate library if necessary. Or put the new Unicode functions in a separate library. But don't arbitrarily drop them.

 

Some folks have suggested that I maintain my own library of the deprecated Latin1 functions. But why should I have to do that? How does that help other folks with the same issue?

 

Bottom line: please please please do not drop the existing Latin-1 string functions.

 

Please don't.

 

Best wishes,

 

LRP

 

 

 

 

 

 

 

 

 

 

 

 



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Strings - deprecated functions

Jesper Louis Andersen-2
In this case, the words comes partly from terms you would find in linguistics, partly words which have specific meaning in the unicode standard.

The problem with Latin-1 and ISO8859-1 and ISO8859-15 are that they work somewhat well for Western Latin languages, but it falls short on almost everything else. If your only concern is truly English text, then there should be no worry at all, since that uses ASCII and the predominant Unicode enconding, UTF-8 was chosen such that there is a 1-1 overlap between the first 128 characters and ASCII.

However, Unicode imposes some difficulties. The most notable one is that you have several ways of writing symbols such as the danish Ø and Å: Either as a specific character, or as a combination: and A and a small ring on top for instance.

In languages the written symbols are graphemes, and collections of symbols forming tokens or words are lexemes. However, because one grapheme can be represented as one or several characters, the notion of a grapheme cluster arises: several code-points which form a single grapheme. It is of utmost importance for certain Asian writing systems in which a single grapheme is composed out of several smaller ones.

For ASCII, however, string:lexemes/2 would work exactly like string:tokens/2. Yet it will handle far more cases.

Unicode presents its own set of complexities. There are several ways of writing a unicode string which is "the same" string in that it renders equally to the human eye. Hence, there are some routines for handling normalization, canonization and collation which by no means are easy to handle.

And finally, it would probably be good to define those terms in the documentation. I don't think they are well-known to most people.  

On Wed, Nov 22, 2017 at 8:59 PM Grzegorz Junka <[hidden email]> wrote:

Dear Lloyd,

Isn't this more about documentation than the code? What I am reading is that you want to keep the old functions because you don't understand how the new functions work. Shouldn't you rather ask for a more clear documentation? Is there anything in the old functions that is not supported in the new functions?

GrzegorzJ


On 22/11/2017 19:43, [hidden email] wrote:

Dear Gods of Erlang,

 

"This module has been reworked in Erlang/OTP 20 to handle unicode:chardata() and operate on grapheme clusters. The old functions that only work on Latin-1 lists as input are still available but should not be used. They will be deprecated in Erlang/OTP 21."

 

I'm sorry. I've brought up this issue before and got lots of push back.

 

But every time I look up tried and true and long-used string functions to find that they are deprecated and will be dropped in future Erlang releases my blood pressure soars. Both my wife and my doctor tell me that at my age this is a dangerous thing.

 

I do understand the importance and necessity of Unicode. And applaud the addition of Unicode functions.

 

But the deprecated string functions have a long history. The English language and Latin-1 characters are widely used around the world. 

 

Yes, it should be easy for programmers to translate code from one user language to another. But I'm not convinced that the Gods of Erlang have found the optimal solution by dropping all Latin-1 string functions.

 

My particular application is directed toward English speakers. So, until further notice, I have no use for Unicode.

 

I don't want to sound like nationalist pig, but I think dropping the Latin-1 string functions from future Erlang releases is a BIG mistake.

 

I look up tokens/2, a function that I use fairly frequently, and I see that it's deprecated. I look up the suggested replacement and I see lexemes/2.

 

So I ask, what the ... is a lexeme? I look it up in Merriam-Webster and I see that a lexeme is  "a meaningful linguistic unit." 

 

Meaning what? I just want to turn "this and that" into "This And That."

 

I read further in the Erlang docs and I see "grapheme cluster."  WHAT THE ... IS GRAPHEME CLUSTER?

 

I look up "grapheme" in Merriam-Webster. Oh it is now all so clear: "a unit of a writing system."

 

Ah yes, grapheme is defined in the docs. But I have to read and re-read the definition to understand what the God's of Erlang mean by a "graphene cluster." And I'm still not sure I get it.

 

It sounds like someone took a linguistics class and is trying to show off.

 

But now I've spent 30 minutes--- time that I don't have to waste trying to figure out how do a simple manipulation of "this and that." Recurse the next time I want to look up a string function in the Erlang docs.

 

SOLUTION

 

Keep the Latin-1 string functions. Put them in a separate library if necessary. Or put the new Unicode functions in a separate library. But don't arbitrarily drop them.

 

Some folks have suggested that I maintain my own library of the deprecated Latin1 functions. But why should I have to do that? How does that help other folks with the same issue?

 

Bottom line: please please please do not drop the existing Latin-1 string functions.

 

Please don't.

 

Best wishes,

 

LRP

 

 

 

 

 

 

 

 

 

 

 

 



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Strings - deprecated functions

Loïc Hoguin-3
In reply to this post by Lloyd R. Prentice-2
Calm down. Considering how ubiquitous the string module is, these
functions are not going to be removed for at least a few years. That
gives you plenty of time to understand the new string module.

Perhaps during this journey you can help make the documentation for the
module more user friendly by sending patches or opening tickets at
bugs.erlang.org. I'll admit that the current documentation does confuse
me personally, though I've not needed to use it yet.

Unfortunately languages are complex and Unicode is therefore also
complex. There's no real way around that. Even if you target English
speakers it's likely that you will need Unicode, because many things
require it like names or addresses for example. So even if it feels like
you won't need it (and maybe you won't) it's a good idea to be ready for it.

I wouldn't say latin1 is widely used anymore. Most of everything uses
Unicode nowadays. Nearly everything switched to Unicode, Erlang is one
of the last. Even your email was sent encoded in utf-8.

On 11/22/2017 08:43 PM, [hidden email] wrote:

> Dear Gods of Erlang,
>
> "This module has been reworked in Erlang/OTP 20 to handle
> unicode:chardata()
> <http://erlang.org/doc/man/unicode.html#type-chardata> and operate on
> grapheme clusters. The old functions
> <http://erlang.org/doc/man/string.html#oldapi> that only work on Latin-1
> lists as input are still available but should not be used. They will be
> deprecated in Erlang/OTP 21."
>
> I'm sorry. I've brought up this issue before and got lots of push back.
>
> But every time I look up tried and true and long-used string functions
> to find that they are deprecated and will be dropped in future Erlang
> releases my blood pressure soars. Both my wife and my doctor tell me
> that at my age this is a dangerous thing.
>
> I do understand the importance and necessity of Unicode. And applaud the
> addition of Unicode functions.
>
> But the deprecated string functions have a long history. The English
> language and Latin-1 characters are widely used around the world.
>
> Yes, it should be easy for programmers to translate code from one user
> language to another. But I'm not convinced that the Gods of Erlang have
> found the optimal solution by dropping all Latin-1 string functions.
>
> My particular application is directed toward English speakers. So, until
> further notice, I have no use for Unicode.
>
> I don't want to sound like nationalist pig, but I think dropping the
> Latin-1 string functions from future Erlang releases is a BIG mistake.
>
> I look up tokens/2, a function that I use fairly frequently, and I see
> that it's deprecated. I look up the suggested replacement and I see
> lexemes/2.
>
> So I ask, what the ... is a lexeme? I look it up in Merriam-Webster and
> I see that a lexeme is  "a meaningful linguistic unit."
>
> Meaning what? I just want to turn "this and that" into "This And That."
>
> I read further in the Erlang docs and I see "grapheme cluster."  WHAT
> THE ... IS GRAPHEME CLUSTER?
>
> I look up "grapheme" in Merriam-Webster. Oh it is now all so clear: "a
> unit of a writing system."
>
> Ah yes, grapheme is defined in the docs. But I have to read and re-read
> the definition to understand what the God's of Erlang mean by a
> "graphene cluster." And I'm still not sure I get it.
>
> It sounds like someone took a linguistics class and is trying to show off.
>
> But now I've spent 30 minutes--- time that I don't have to waste trying
> to figure out how do a simple manipulation of "this and that." Recurse
> the next time I want to look up a string function in the Erlang docs.
>
> SOLUTION
>
> Keep the Latin-1 string functions. Put them in a separate library if
> necessary. Or put the new Unicode functions in a separate library. But
> don't arbitrarily drop them.
>
> Some folks have suggested that I maintain my own library of the
> deprecated Latin1 functions. But why should I have to do that? How does
> that help other folks with the same issue?
>
> Bottom line: please please please do not drop the existing Latin-1
> string functions.
>
> Please don't.
>
> Best wishes,
>
> LRP
>
>
>
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions
>

--
Loïc Hoguin
https://ninenines.eu
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Strings - deprecated functions

PAILLEAU Eric
In reply to this post by Lloyd R. Prentice-2

Le 22 nov. 2017 8:43 PM, [hidden email] a écrit :

Dear Gods of Erlang,

 

"This module has been reworked in Erlang/OTP 20 to handle unicode:chardata() and operate on grapheme clusters. The old functions that only work on Latin-1 lists as input are still available but should not be used. They will be deprecated in Erlang/OTP 21."

 

I'm sorry. I've brought up this issue before and got lots of push back.

 

But every time I look up tried and true and long-used string functions to find that they are deprecated and will be dropped in future Erlang releases my blood pressure soars. Both my wife and my doctor tell me that at my age this is a dangerous thing.

 

I do understand the importance and necessity of Unicode. And applaud the addition of Unicode functions.

 

But the deprecated string functions have a long history. The English language and Latin-1 characters are widely used around the world. 

 

Yes, it should be easy for programmers to translate code from one user language to another. But I'm not convinced that the Gods of Erlang have found the optimal solution by dropping all Latin-1 string functions.

 

My particular application is directed toward English speakers. So, until further notice, I have no use for Unicode.

 

I don't want to sound like nationalist pig, but I think dropping the Latin-1 string functions from future Erlang releases is a BIG mistake.

 

I look up tokens/2, a function that I use fairly frequently, and I see that it's deprecated. I look up the suggested replacement and I see lexemes/2.

 

So I ask, what the ... is a lexeme? I look it up in Merriam-Webster and I see that a lexeme is  "a meaningful linguistic unit." 

 

Meaning what? I just want to turn "this and that" into "This And That."

 

I read further in the Erlang docs and I see "grapheme cluster."  WHAT THE ... IS GRAPHEME CLUSTER?

 

I look up "grapheme" in Merriam-Webster. Oh it is now all so clear: "a unit of a writing system."

 

Ah yes, grapheme is defined in the docs. But I have to read and re-read the definition to understand what the God's of Erlang mean by a "graphene cluster." And I'm still not sure I get it.

 

It sounds like someone took a linguistics class and is trying to show off.

 

But now I've spent 30 minutes--- time that I don't have to waste trying to figure out how do a simple manipulation of "this and that." Recurse the next time I want to look up a string function in the Erlang docs.

 

SOLUTION

 

Keep the Latin-1 string functions. Put them in a separate library if necessary. Or put the new Unicode functions in a separate library. But don't arbitrarily drop them.

 

Some folks have suggested that I maintain my own library of the deprecated Latin1 functions. But why should I have to do that? How does that help other folks with the same issue?

 

Bottom line: please please please do not drop the existing Latin-1 string functions.

 

Please don't.

 

Best wishes,

 

LRP

 

 

 

 

 

 

 

 

 

 

 

 



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

DIa_M7LUQAEhYnC.jpg (226K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Strings - deprecated functions

Onorio Catenacci
In reply to this post by Lloyd R. Prentice-2
I have had several colleagues apply analogous logic in regards to concurrency and the actor model as you've applied to using Latin-1.  That is colleagues tell me that they will never need concurrency and their apps will never be run on multiple machines so why pay the "mental tax" of using Erlang? 

I'm not saying you're wrong but maybe you want to keep your options open.


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Strings - deprecated functions

Dan Gudmundsson-3
In reply to this post by Grzegorz Junka


On Wed, Nov 22, 2017 at 8:59 PM Grzegorz Junka <[hidden email]> wrote:

Dear Lloyd,

Isn't this more about documentation than the code? What I am reading is that you want to keep the old functions because you don't understand how the new functions work. Shouldn't you rather ask for a more clear documentation? Is there anything in the old functions that is not supported in the new functions?

GrzegorzJ


On 22/11/2017 19:43, [hidden email] wrote:

Dear Gods of Erlang,

 

"This module has been reworked in Erlang/OTP 20 to handle unicode:chardata() and operate on grapheme clusters. The old functions that only work on Latin-1 lists as input are still available but should not be used. They will be deprecated in Erlang/OTP 21."

 


The new functions also works on binaries (or even unicode:chardata() which basically are io-lists but with unicode support) 
which is in many cases a better representation of strings.

That means you can append two long strings with ["long string 1..", "long string 2.."] 
and the string will be "flattened" when output to a file or socket.

So you can see the string module as an introduction how you should handle strings in erlang efficiently. :-)
Though, that said, many of the new functions are slower then the functionality the are replacing, 
optimizations for the ASCII/Latin-1 case are being worked on.

I changed the api rather drastically I know that.
The reason was that in the old 'C' inspired API, you searched first and returned an index,
 then split the string on that index.

Since handling unicode, grapheme clusters, binaries and deep lists of "characters" increases the cost of traversing
the input string, I went with a new API which combines the two calls into one. e.g. find() or take() which searches the string
and returns the result directly, to avoid the extra traversal.

I'm sorry. I've brought up this issue before and got lots of push back.

 

But every time I look up tried and true and long-used string functions to find that they are deprecated and will be dropped in future Erlang releases my blood pressure soars. Both my wife and my doctor tell me that at my age this is a dangerous thing.

 

I do understand the importance and necessity of Unicode. And applaud the addition of Unicode functions.

 

But the deprecated string functions have a long history. The English language and Latin-1 characters are widely used around the world. 

 

Yes, it should be easy for programmers to translate code from one user language to another. But I'm not convinced that the Gods of Erlang have found the optimal solution by dropping all Latin-1 string functions.


We have not said we will drop them, only deprecate them. They will stay with us for a long time.
I want to remove them from the docs, because the manual page becomes a monster, but we will see what happens with that idea.
 

 

My particular application is directed toward English speakers. So, until further notice, I have no use for Unicode.

 

I don't want to sound like nationalist pig, but I think dropping the Latin-1 string functions from future Erlang releases is a BIG mistake.

 

I look up tokens/2, a function that I use fairly frequently, and I see that it's deprecated. I look up the suggested replacement and I see lexemes/2.

 

So I ask, what the ... is a lexeme? I look it up in Merriam-Webster and I see that a lexeme is  "a meaningful linguistic unit." 

 

Meaning what? I just want to turn "this and that" into "This And That."


So I had to google a lot to come up with a "tokens" replacement function name, a lexeme is what you think a token is or at least what I thought it was. 
 
But 'tokens' is replaced by 'lexemes' and work exactly the same for ASCII lists as before with one exception [CR,LF] but more on that
below.

 

I read further in the Erlang docs and I see "grapheme cluster."  WHAT THE ... IS GRAPHEME CLUSTER?

This is best description that I found and what I wrote in the manual page:
  grapheme cluster is a user-perceived character, which can be represented by several codepoints.

I don't know if I can explain it better than that, that is the term used in the unicode standard, more information can be found there.
Below that line I have the self explaining example:

"å"  [229] or [97, 778]
"e̊"  [101, 778]

So in Swedish we have the user-perceived character å which is a 'a' with a dot above, 
that can be represented in unicode with codepoint 229 (å) or with the two codepoints  97 (a) 778 (dot above).

So with that we can make "new" combined characters, as I tried with the 'e' and a dot above, which for
me google chrome does not render correctly, the dot should be placed directly above the 'e'.

This representation is important to avoid a "character" explosion for non LATIN-1 character sets such as in Asian and Arabic languages.

You can change between the representations of 'å' with unicode:characters_to_nf[c|d]_list.
It is important that you normalize your data you get from the outside world to one representation before operating on it.

But what they (the Unicode people) did was also define [CR,LF] as one grapheme which makes it impossible to use/extend the old 
functions in compatible way.

So to split a multi-line string into lines, where you previously did:
Lines = string:tokens(" a \n b \r\n c", "\r\n"),    %% Split line on CR or LF
You must now rewrite that to:
Lines = string:lexemes(""a \n b \r\n c", ["\r\n",$n]),  %% Split line on CR,LF or LF

Blame the standard and not me :-)

 

I look up "grapheme" in Merriam-Webster. Oh it is now all so clear: "a unit of a writing system."

 

Ah yes, grapheme is defined in the docs. But I have to read and re-read the definition to understand what the God's of Erlang mean by a "graphene cluster." And I'm still not sure I get it.

 

It sounds like someone took a linguistics class and is trying to show off.


As you can see in this email my linguistic knowledge are way worse than yours, so maybe you can help and improve the manual.
But it is tough to describe unicode handling in an easy way. 
When a character is not a character anymore it becomes fuzzy fast..

Everything here is from the top of my head and not tested, it's too late for that here and I have a Zelda boss to beat, my kids
are way ahead of me.

BR
/Dan
 

 

But now I've spent 30 minutes--- time that I don't have to waste trying to figure out how do a simple manipulation of "this and that." Recurse the next time I want to look up a string function in the Erlang docs.

 

SOLUTION

 

Keep the Latin-1 string functions. Put them in a separate library if necessary. Or put the new Unicode functions in a separate library. But don't arbitrarily drop them.

 

Some folks have suggested that I maintain my own library of the deprecated Latin1 functions. But why should I have to do that? How does that help other folks with the same issue?

 

Bottom line: please please please do not drop the existing Latin-1 string functions.

 

Please don't.

 

Best wishes,

 

LRP

 

 

 

 

 

 

 

 

 

 

 

 



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Strings - deprecated functions

Lloyd R. Prentice-2

Hi All,

 

Hardly a point in responses so far that I could disagree with or argue against. Although I still contend that the name "lexeme" is about as ugly a term as I could imagine.

 

So a challenge:

 

Can someone please write a short tutorial that shows on a one-to-one basis how to use Unicode functions to replace the Latin-1 functions then provide a link to it from the string docs. Or, better yet, simply integrate it into the string docs.

 

Nevertheless, I'm still deeply troubled.

 

Many thanks to all.

 

Lloyd

 

 

 

 

-----Original Message-----
From: "Dan Gudmundsson" <[hidden email]>
Sent: Wednesday, November 22, 2017 4:45pm
To: "Grzegorz Junka" <[hidden email]>
Cc: [hidden email]
Subject: Re: [erlang-questions] Strings - deprecated functions



On Wed, Nov 22, 2017 at 8:59 PM Grzegorz Junka <[hidden email]> wrote:

Dear Lloyd,

Isn't this more about documentation than the code? What I am reading is that you want to keep the old functions because you don't understand how the new functions work. Shouldn't you rather ask for a more clear documentation? Is there anything in the old functions that is not supported in the new functions?

GrzegorzJ


On 22/11/2017 19:43, [hidden email] wrote:

Dear Gods of Erlang,

 

"This module has been reworked in Erlang/OTP 20 to handle unicode:chardata() and operate on grapheme clusters. The old functions that only work on Latin-1 lists as input are still available but should not be used. They will be deprecated in Erlang/OTP 21."

 

The new functions also works on binaries (or even unicode:chardata() which basically are io-lists but with unicode support) 
which is in many cases a better representation of strings.
That means you can append two long strings with ["long string 1..", "long string 2.."] 
and the string will be "flattened" when output to a file or socket.
So you can see the string module as an introduction how you should handle strings in erlang efficiently. :-)
Though, that said, many of the new functions are slower then the functionality the are replacing, 
optimizations for the ASCII/Latin-1 case are being worked on.
I changed the api rather drastically I know that.
The reason was that in the old 'C' inspired API, you searched first and returned an index,
 then split the string on that index.
Since handling unicode, grapheme clusters, binaries and deep lists of "characters" increases the cost of traversing
the input string, I went with a new API which combines the two calls into one. e.g. find() or take() which searches the string
and returns the result directly, to avoid the extra traversal.

I'm sorry. I've brought up this issue before and got lots of push back.

 

But every time I look up tried and true and long-used string functions to find that they are deprecated and will be dropped in future Erlang releases my blood pressure soars. Both my wife and my doctor tell me that at my age this is a dangerous thing.

 

I do understand the importance and necessity of Unicode. And applaud the addition of Unicode functions.

 

But the deprecated string functions have a long history. The English language and Latin-1 characters are widely used around the world. 

 

Yes, it should be easy for programmers to translate code from one user language to another. But I'm not convinced that the Gods of Erlang have found the optimal solution by dropping all Latin-1 string functions.

We have not said we will drop them, only deprecate them. They will stay with us for a long time.
I want to remove them from the docs, because the manual page becomes a monster, but we will see what happens with that idea.
 

 

My particular application is directed toward English speakers. So, until further notice, I have no use for Unicode.

 

I don't want to sound like nationalist pig, but I think dropping the Latin-1 string functions from future Erlang releases is a BIG mistake.

 

I look up tokens/2, a function that I use fairly frequently, and I see that it's deprecated. I look up the suggested replacement and I see lexemes/2.

 

So I ask, what the ... is a lexeme? I look it up in Merriam-Webster and I see that a lexeme is  "a meaningful linguistic unit." 

 

Meaning what? I just want to turn "this and that" into "This And That."

So I had to google a lot to come up with a "tokens" replacement function name, a lexeme is what you think a token is or at least what I thought it was. 
 
But 'tokens' is replaced by 'lexemes' and work exactly the same for ASCII lists as before with one exception [CR,LF] but more on that
below.

 

 

I read further in the Erlang docs and I see "grapheme cluster."  WHAT THE ... IS GRAPHEME CLUSTER?

This is best description that I found and what I wrote in the manual page:
  grapheme cluster is a user-perceived character, which can be represented by several codepoints.
I don't know if I can explain it better than that, that is the term used in the unicode standard, more information can be found there.
Below that line I have the self explaining example:

"å" [229] or [97, 778] "e̊" [101, 778]

So in Swedish we have the user-perceived character å which is a 'a' with a dot above, 
that can be represented in unicode with codepoint 229 (å) or with the two codepoints  97 (a) 778 (dot above).
So with that we can make "new" combined characters, as I tried with the 'e' and a dot above, which for
me google chrome does not render correctly, the dot should be placed directly above the 'e'.
This representation is important to avoid a "character" explosion for non LATIN-1 character sets such as in Asian and Arabic languages.
You can change between the representations of 'å' with unicode:characters_to_nf[c|d]_list.
It is important that you normalize your data you get from the outside world to one representation before operating on it.
But what they (the Unicode people) did was also define [CR,LF] as one grapheme which makes it impossible to use/extend the old 
functions in compatible way.
So to split a multi-line string into lines, where you previously did:
Lines = string:tokens(" a \n b \r\n c", "\r\n"),    %% Split line on CR or LF
You must now rewrite that to:
Lines = string:lexemes(""a \n b \r\n c", ["\r\n",$n]),  %% Split line on CR,LF or LF
Blame the standard and not me :-)

 

I look up "grapheme" in Merriam-Webster. Oh it is now all so clear: "a unit of a writing system."

 

Ah yes, grapheme is defined in the docs. But I have to read and re-read the definition to understand what the God's of Erlang mean by a "graphene cluster." And I'm still not sure I get it.

 

It sounds like someone took a linguistics class and is trying to show off.

As you can see in this email my linguistic knowledge are way worse than yours, so maybe you can help and improve the manual.
But it is tough to describe unicode handling in an easy way. 
When a character is not a character anymore it becomes fuzzy fast..
Everything here is from the top of my head and not tested, it's too late for that here and I have a Zelda boss to beat, my kids
are way ahead of me.
BR
/Dan
 

 

But now I've spent 30 minutes--- time that I don't have to waste trying to figure out how do a simple manipulation of "this and that." Recurse the next time I want to look up a string function in the Erlang docs.

 

SOLUTION

 

Keep the Latin-1 string functions. Put them in a separate library if necessary. Or put the new Unicode functions in a separate library. But don't arbitrarily drop them.

 

Some folks have suggested that I maintain my own library of the deprecated Latin1 functions. But why should I have to do that? How does that help other folks with the same issue?

 

Bottom line: please please please do not drop the existing Latin-1 string functions.

 

Please don't.

 

Best wishes,

 

LRP

 

 

 

 

 

 

 

 

 

 

 

 



_______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Strings - deprecated functions

Tristan Sloughter-4
Can someone please write a short tutorial that shows on a one-to-one basis how to use Unicode functions to replace the Latin-1 functions then provide a link to it from the string docs. Or, better yet, simply integrate it into the string docs.

In your first email you already mentioned the fact that the docs do provide a one-to-one mapping, it links directly to the new function to use from the doc of the deprecated function. Like in your case of tokens to lexemes the same arguments work and the functions have examples, so what is missing from the docs?

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Strings - deprecated functions

Lloyd R. Prentice-2

Hi Tristan,

 

When I have a few moments of free time, I'll go through the string docs carefully and try to point out all the issues that I find confusing or obscure.

 

Maybe I'm the only person in the world with issues here. If so, I'll shut up and try to get with the program. But I can't tell you how much time I've wasted trying to grasp how and when to use various functions in the docs--- and certainly not only the string documentation.

 

I've been trying for more than three years now to master Erlang sufficiently well to build my current project. The canonical books have made it possible (thanks, Joe, et. al.) But I'm an army of one. I can't turn to the programmer next me or my programming supervisor to ask what's what.

 

This list is a god-send and the folks on it are extraordinarily gracious and generous.  But for all that, there are still major libraries and functions that I've looked at thinking, hey, that might be useful. But I can't figure out for the life me how and when to use them even though they might be quite useful.

 

If we see reason to leave these obscure functions in, why can't we leave the Latin-1 functions? Even if only in name as a wrapper around the Unicode functions that deliver the same functionality?

 

When I launch my current project, I'll be happy to dig in deep and do what I can to help improve the docs. The best that I can offer is my profound ignorance and willingness to ask Micky the Dunce questions.

 

One Erlanger pointed out that they would rather have maintainers work on bug fixes and new features than documentation. I'd argue that without clear and inviting documentation we discourage adopters and the cripple the vitality of our community.

 

I'm eager to do what I can do improve the documentation. But I can't spend much time at it until I get paid. And I don't get paid until I've launched my current project.

 

All the best,

 

Lloyd

 

 

-----Original Message-----
From: "Tristan Sloughter" <[hidden email]>
Sent: Wednesday, November 22, 2017 5:22pm
To: [hidden email]
Subject: Re: [erlang-questions] Strings - deprecated functions

Can someone please write a short tutorial that shows on a one-to-one basis how to use Unicode functions to replace the Latin-1 functions then provide a link to it from the string docs. Or, better yet, simply integrate it into the string docs.
In your first email you already mentioned the fact that the docs do provide a one-to-one mapping, it links directly to the new function to use from the doc of the deprecated function. Like in your case of tokens to lexemes the same arguments work and the functions have examples, so what is missing from the docs?

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Strings - deprecated functions

Richard A. O'Keefe-2
In reply to this post by Lloyd R. Prentice-2
[hidden email] wrote:
> I read further in the Erlang docs and I see "grapheme cluster."  WHAT THE
> ... IS GRAPHEME CLUSTER?
>
> I look up "grapheme" in Merriam-Webster. Oh it is now all so clear: "a
> unit of a writing system."

"Grapheme" and "grapheme cluster" are technical terms in Unicode.
The best place to look is probably UAX#29 Unicode Text Segmentantion
http://unicode.org/reports/tr29/
Section 3 begins with this paragraph, which should help:
  It is important to recognize that what the user thinks of
  as a “character”—a basic unit of a writing system for a
  language—may not be just a single Unicode code point.
  Instead, that basic unit may be made up of multiple Unicode
  code points.  To avoid ambiguity with the computer use of the
  term character, this is called a user-perceived character.
  For example, “G” + acute-accent is a user-perceived character:
  users think of it as a single character, yet is actually
  represented by two Unicode code points.  These user-perceived
  characters are approximated by what is called a grapheme cluster,
  which can be determined programmatically.

> It sounds like someone took a linguistics class and is trying to show off.

It would be pretty horrifying if many of the people defining Unicode
hadn't taken a linguistics class or three...  It's actually a very
obvious practical problem:  suppose you are in your favorite editor
and press the "move forward 1 character" key.  The distinction between
Unicode and UTF-8 makes it sufficiently clear that this doesn't mean
"move forward one byte" (C), and the distinction between Unicode and
UTF-16 makes it sufficiently clear that it doesn't mean "move forward
one 16-bit char" (Java).  But it doesn't mean "move forward one Unicode
code point" either.  There is no limit in principle to the number of
code points in a user-perceived character.  Figuring out just how many
code points in a "character" (= grapheme cluster) is sufficiently
tricky that you do not want to do it yourself.

The text *I* generate is almost exclusively Latin-1, but it is less
and less common for me to *get* data in that form.  I too would like
full retention of Latin-1 support >>for data I am fully in control of<<.



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Strings - deprecated functions

Fred Hebert-2
In reply to this post by Lloyd R. Prentice-2
On 11/22, [hidden email] wrote:
>I read further in the Erlang docs and I see "grapheme cluster."  WHAT
>THE ... IS GRAPHEME CLUSTER?
>

A quick run-through. In ASCII and latin-1 you mostly can deal with the
following words, which are all synonymous:

- character
- letter
- symbol

In some variants, you also have to add the word 'diacritic' or 'accent'
which let you modify a character in terms of linguistincs:

a + ` = à

Fortunately, in latin1, most of these naughty diacritics have been
bundled into specific characters. In French, for example, this final
'letter' can be represented under a single code (224).

There are however complications coming from that. One of them is
'collation' (the sorting order of letters). For example, a and à in
French ought to sort in the same portion of the alphabet (before 'b'),
but by default, they end up sorting after 'z'.

In Danish, A is the first letter of the alphabet, but Å is last. Also Å
is seen as a ligature of Aa; Aa is sorted like Å rather than two letters
'Aa' one after the other. Swedish has different diacritics with
different orders: Å, Ä, Ö.

So uh, currently, Erlang did not even do a great job at Latin-1 because
there was nothing to handle 'collations' (string comparisons to know
what is equal or not).


Enter UNICODE. To make a matter short (and I hope Mr. ROK won't be too
mad at my gross oversimplifications), we have the following terms in
vocabulary:

- character: smallest representable unit in a language, in the abstract.  
  '`' is a character, so is 'a', and so is 'à'
- glyph: the visual representation of a character. Think of it as a
  character from the point of view of the font or typeface designer. For
  example, the same glyph may be used for the capital letter 'pi' and
  the mathematical symbol for a product: ∏. Similarly, capital 'Sigma'
  and the mathematical 'sum' may have different character
  representation, but the same ∑ glyph.
- letter: an element of an alphabet
- codepoint: A given value in the unicode space. There's a big table
  with a crapload of characters in them, and every character is assigned
  a codepoint, as a unique identifier for it.
- code unit: a specific encoding of a given code point. This refers to
  bits, not just the big table. The same code point may have different
  code units in UTF-8, UTF-16, and UTF-32, which are 3 'encodings' of
  unicode.
- grapheme: what the user thinks of a 'character'
- grapheme cluster: what you want to think of as a 'character' for your
  user's sake. Basically, 'a' and '`' can be two graphemes, but if I
  combine them together as 'à', I want to be able to say that a single
  'delete' key press will remove both the '`' and the 'a' at once from
  my text, and not be left with one or the other.

We're left with the word 'lexeme' which is not really defined in the
unicode glossary. Linguists will treat it as a lexical unit (word or
term of vocabulary). In computer talk, you'd just define it as an
arbitrary string, or maybe token (it appears some people use them
interchangeably).

The big fun bit is that unicode takes all these really shitty
complicated linguistic things and specifies how they should be handled.

Like, what makes two strings equal? I understand it's of little
importance in English, but the french 'é' can be represented both as a
single é or as e+´. It would be good, when you deal with say JSON or
maybe my username, that you don't end up having 'Frédéric' as 4
different people depending on which form was used. JSON, by the way,
specifies 'unicode' as an encoding!

In any case, these encoding rules are specified in normalization forms
(http://unicode.org/reports/tr15/). The new interface lets you compare
string with 'string:compare(A, B, IgnoreCase, nfc | nfk | nfkc | nfkd)'
which is good, because the rules for changing case are also language- or
alphabet-specific.

So when you look at functions like 'string:next_grapheme/1' and
'string:next_codepoint/1', they're related to whether you want to
consume the data in terms of user-observable 'characters' or in terms of
unicode-specific 'characters'. Because they're not the same, and
depending on what you want to do, this is important.

You could call 'string:to_graphemes' and get an iterable list the way
you could use them before:

1> string:to_graphemes("ß↑e̊").
[223,8593,[101,778]]
2> string:to_graphemes(<<"ß↑e̊"/utf8>>).
[223,8593,[101,778]]

But now it's working regardless of the initial format! This is really
freaking cool.

>SOLUTION
>

Translation!

centre/2-3 ==> pad/2-4
    Same thing, except pad is more generic and accepts a direction
chars/2    ==> lists:duplicate/2
    Same thing, except the 2 arguments are flipped.
chars/3    ==> ???
    No direct match, but just call [lists:duplicate(N, Elem)|Tail] to
    get an equivalence
chr/2      ==> find/2-3 (with 3rd argument 'leading')
    whereas chr/2 returns a position, find/2-3 returns the string after
    the match. This leaves a bit of a gap if you're looking to take
    everyting *until* a given character (look at take/3-4 if you need a
    single character, or maybe string:split/2-3), or really the
    position, but in Unicode the concept of a position is vague: is it
    based on code units, codepoints, grapheme clusters, or what?
concat/2   ==> ???
    You can concatenate strings by using iolists: [A,B]. If you need to
    flatten the string with unicode:character_to_[list|binary].
copies/2   ==> lists:duplicate/2
    Same thing, except the two arguments are flipped
cspan/2    ==> take/3-4
    specifically, cspan(Str, Chars) is equivalent to take(Str, Chars,
    false, leading). Returns a pair of {Before, After} strings rather
    than a length.
join/2     ==> lists:join/2
    same thing, but the arguments are flipped
left/2-3   ==> pad/2-4
    same thing, except pad is more generic and accepts a direction
len/1      ==> length/1
    returns grapheme cluster counts rather than 'characters'.
rchr/2     ==> find/2-3 (with 3rd argument 'trailing')
    see chr/2 conversion for description.
right/2-3  ==> pad/2-4
    same as center/2-3 and left/2-3.
rstr/2     ==> find/3
    use 'trailing' as third argument for similar semantics. Drops
    characters before the match and returns the leftover string rather
    than just an index. A bit of a gap if you want the opposite, maybe
    use string:split/2-3
span/2     ==> take/2
    no modifications required for arguments, but take/2 returns a
    {Before, After} pair of strings rather than a length.
str/2      ==> find/2
    use 'leading' as a third argument. Drops characters before the match
    rather than just an index. Maybe string:split/2-3 is nicer there?
strip/1-3  ==> trim/1-3
    Same, aside from the direction. strip/2-3 accepted 'left | right |
    both' whereas trim/2-3 accepts 'leading | trailing | both'. Be
    careful. Oh also strip/3 takes a single character as an argument and
    trim/3 takes a list of characters.
sub_string/2-3 ==> slice/2-3
    (String, Start, Stop) is changed for (String, Start, Length). This
    reflects the idea that grapheme clusters make it a lot harder to
    know where in a string a specific position is. The length is in
    grapheme clusters.
substr/2-3 ==> slice/2-3
    no change
sub_word/2-3 ==> nth_lexeme/3
    Same, except rather than a single character in the last position, it
    now takes a list of separators (grapheme clusters). So ".e" is
    actually the list of [$., $e], two distinct separators.
to_lower/1 ==> lowercase/1
    Same
to_upper/1 ==> uppercase/1
    Same
tokens/2   ==> lexemes/2
    Same
words/2    ==> lexemes/2
    Same, but lexemes/2 accepts a list of 'characters' (grapheme
    clusters) instead of a single one of them.


The biggest annoyance I have had converting so far was handling
find/2-3; in a lot of places in code, I had patterns where the objective
was to drop what happend *after* a given character, and the function
does just the opposite. You can take a look at string:split/2-3 there.

The second biggest annoyance is making sure that functions that used to
take just a single character now may take more than one of them. It
makes compatibility a bit weird.

>Keep the Latin-1 string functions. Put them in a separate library if
>necessary. Or put the new Unicode functions in a separate library. But
>don't arbitrarily drop them.
>
>Some folks have suggested that I maintain my own library of the
>deprecated Latin1 functions. But why should I have to do that? How does
>that help other folks with the same issue?
>

I had a few problems with it myself; I just finished updating rebar3 and
dependencies to run on both >OTP-21 releases and stuff dating back to
OTP-16. The problem we have is that we run on a backwards compat
schedule that is stricter and longer than the OTP team.

For example, string:join/2 is being replaced with lists:join/2, but
lists:join/2 did not exist in R16 and string:join/2 is deprecated in
OTP-21. So we needed to extract that function from OTP into custom
modules everywhere, and replace old usage with new one.

I was forced to add translation functions and modules like
https://github.com/erlang/rebar3/blob/master/src/rebar_string.erl to the
code base, along with this conditional define: https://github.com/erlang/rebar3/blob/master/rebar.config#L33

It's a bit painful, but it ends up working quite alright. Frankly it's
nicer if you can adopt OTP's deprecation pace, it's quite painful for us
being on a larger sequence.

>Bottom line: please please please do not drop the existing Latin-1
>string functions.
>
>Please don't.
>
>Best wishes,
>
>LRP
>

It's probably alright if they keep warnings for a good period of time
before dropping everything. OTP-21 starts warning so people who want to
keep 'warning_as_errors' as an option will suffer the most.

But overall, you can't escape Unicode. Work with JSON? There it is.  
HTTP? There again. URLs? You bet. File systems? Hell yes! Erlang modules
and app files: also yes!

The one place I've seen a deliberate attempt at not doing it was with
host names in certificate validation (or DNS), since there, making
distinct domain names compare the same could be an attack vector. There
you get to deal with the magic of punycode
(https://en.wikipedia.org/wiki/Punycode) if you want to be safer.

- Fred.
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Strings - deprecated functions

Lloyd R. Prentice-2
A big hearty thanks to ok and Fred for the terrific clarifications.

I guess I'll just have to suck it up and convert all Latin-1 functions that I've written so far to Unicode functions. If I wait until later I may not be among the living. Hate to foist that off on some unsuspecting soul.

Meanwhile, I've just pushed my release target out another month (or two).

And thanks to all.

Lloyd

Sent from my iPad

> On Nov 22, 2017, at 9:45 PM, Fred Hebert <[hidden email]> wrote:
>
>> On 11/22, [hidden email] wrote:
>> I read further in the Erlang docs and I see "grapheme cluster."  WHAT THE ... IS GRAPHEME CLUSTER?
>>
>
> A quick run-through. In ASCII and latin-1 you mostly can deal with the following words, which are all synonymous:
>
> - character
> - letter
> - symbol
>
> In some variants, you also have to add the word 'diacritic' or 'accent' which let you modify a character in terms of linguistincs:
>
> a + ` = à
>
> Fortunately, in latin1, most of these naughty diacritics have been bundled into specific characters. In French, for example, this final 'letter' can be represented under a single code (224).
>
> There are however complications coming from that. One of them is 'collation' (the sorting order of letters). For example, a and à in French ought to sort in the same portion of the alphabet (before 'b'), but by default, they end up sorting after 'z'.
>
> In Danish, A is the first letter of the alphabet, but Å is last. Also Å is seen as a ligature of Aa; Aa is sorted like Å rather than two letters 'Aa' one after the other. Swedish has different diacritics with different orders: Å, Ä, Ö.
>
> So uh, currently, Erlang did not even do a great job at Latin-1 because there was nothing to handle 'collations' (string comparisons to know what is equal or not).
>
>
> Enter UNICODE. To make a matter short (and I hope Mr. ROK won't be too mad at my gross oversimplifications), we have the following terms in vocabulary:
>
> - character: smallest representable unit in a language, in the abstract.   '`' is a character, so is 'a', and so is 'à'
> - glyph: the visual representation of a character. Think of it as a  character from the point of view of the font or typeface designer. For  example, the same glyph may be used for the capital letter 'pi' and  the mathematical symbol for a product: ∏. Similarly, capital 'Sigma'  and the mathematical 'sum' may have different character  representation, but the same ∑ glyph.
> - letter: an element of an alphabet
> - codepoint: A given value in the unicode space. There's a big table  with a crapload of characters in them, and every character is assigned  a codepoint, as a unique identifier for it.
> - code unit: a specific encoding of a given code point. This refers to  bits, not just the big table. The same code point may have different  code units in UTF-8, UTF-16, and UTF-32, which are 3 'encodings' of  unicode.
> - grapheme: what the user thinks of a 'character'
> - grapheme cluster: what you want to think of as a 'character' for your  user's sake. Basically, 'a' and '`' can be two graphemes, but if I  combine them together as 'à', I want to be able to say that a single  'delete' key press will remove both the '`' and the 'a' at once from  my text, and not be left with one or the other.
>
> We're left with the word 'lexeme' which is not really defined in the unicode glossary. Linguists will treat it as a lexical unit (word or term of vocabulary). In computer talk, you'd just define it as an arbitrary string, or maybe token (it appears some people use them interchangeably).
>
> The big fun bit is that unicode takes all these really shitty complicated linguistic things and specifies how they should be handled.
>
> Like, what makes two strings equal? I understand it's of little importance in English, but the french 'é' can be represented both as a single é or as e+´. It would be good, when you deal with say JSON or maybe my username, that you don't end up having 'Frédéric' as 4 different people depending on which form was used. JSON, by the way, specifies 'unicode' as an encoding!
>
> In any case, these encoding rules are specified in normalization forms (http://unicode.org/reports/tr15/). The new interface lets you compare string with 'string:compare(A, B, IgnoreCase, nfc | nfk | nfkc | nfkd)' which is good, because the rules for changing case are also language- or alphabet-specific.
>
> So when you look at functions like 'string:next_grapheme/1' and 'string:next_codepoint/1', they're related to whether you want to consume the data in terms of user-observable 'characters' or in terms of unicode-specific 'characters'. Because they're not the same, and depending on what you want to do, this is important.
>
> You could call 'string:to_graphemes' and get an iterable list the way you could use them before:
>
> 1> string:to_graphemes("ß↑e̊").
> [223,8593,[101,778]]
> 2> string:to_graphemes(<<"ß↑e̊"/utf8>>).
> [223,8593,[101,778]]
>
> But now it's working regardless of the initial format! This is really freaking cool.
>
>> SOLUTION
>>
>
> Translation!
>
> centre/2-3 ==> pad/2-4
>   Same thing, except pad is more generic and accepts a direction
> chars/2    ==> lists:duplicate/2
>   Same thing, except the 2 arguments are flipped.
> chars/3    ==> ???
>   No direct match, but just call [lists:duplicate(N, Elem)|Tail] to    get an equivalence
> chr/2      ==> find/2-3 (with 3rd argument 'leading')
>   whereas chr/2 returns a position, find/2-3 returns the string after    the match. This leaves a bit of a gap if you're looking to take    everyting *until* a given character (look at take/3-4 if you need a    single character, or maybe string:split/2-3), or really the    position, but in Unicode the concept of a position is vague: is it    based on code units, codepoints, grapheme clusters, or what?
> concat/2   ==> ???
>   You can concatenate strings by using iolists: [A,B]. If you need to    flatten the string with unicode:character_to_[list|binary].
> copies/2   ==> lists:duplicate/2
>   Same thing, except the two arguments are flipped
> cspan/2    ==> take/3-4
>   specifically, cspan(Str, Chars) is equivalent to take(Str, Chars,    false, leading). Returns a pair of {Before, After} strings rather    than a length.
> join/2     ==> lists:join/2
>   same thing, but the arguments are flipped
> left/2-3   ==> pad/2-4
>   same thing, except pad is more generic and accepts a direction
> len/1      ==> length/1
>   returns grapheme cluster counts rather than 'characters'.
> rchr/2     ==> find/2-3 (with 3rd argument 'trailing')
>   see chr/2 conversion for description.
> right/2-3  ==> pad/2-4
>   same as center/2-3 and left/2-3.
> rstr/2     ==> find/3
>   use 'trailing' as third argument for similar semantics. Drops    characters before the match and returns the leftover string rather    than just an index. A bit of a gap if you want the opposite, maybe    use string:split/2-3
> span/2     ==> take/2
>   no modifications required for arguments, but take/2 returns a    {Before, After} pair of strings rather than a length.
> str/2      ==> find/2
>   use 'leading' as a third argument. Drops characters before the match    rather than just an index. Maybe string:split/2-3 is nicer there?
> strip/1-3  ==> trim/1-3
>   Same, aside from the direction. strip/2-3 accepted 'left | right |    both' whereas trim/2-3 accepts 'leading | trailing | both'. Be    careful. Oh also strip/3 takes a single character as an argument and    trim/3 takes a list of characters.
> sub_string/2-3 ==> slice/2-3
>   (String, Start, Stop) is changed for (String, Start, Length). This    reflects the idea that grapheme clusters make it a lot harder to    know where in a string a specific position is. The length is in    grapheme clusters.
> substr/2-3 ==> slice/2-3
>   no change
> sub_word/2-3 ==> nth_lexeme/3
>   Same, except rather than a single character in the last position, it    now takes a list of separators (grapheme clusters). So ".e" is    actually the list of [$., $e], two distinct separators.
> to_lower/1 ==> lowercase/1
>   Same
> to_upper/1 ==> uppercase/1
>   Same
> tokens/2   ==> lexemes/2
>   Same
> words/2    ==> lexemes/2
>   Same, but lexemes/2 accepts a list of 'characters' (grapheme    clusters) instead of a single one of them.
>
>
> The biggest annoyance I have had converting so far was handling find/2-3; in a lot of places in code, I had patterns where the objective was to drop what happend *after* a given character, and the function does just the opposite. You can take a look at string:split/2-3 there.
>
> The second biggest annoyance is making sure that functions that used to take just a single character now may take more than one of them. It makes compatibility a bit weird.
>
>> Keep the Latin-1 string functions. Put them in a separate library if necessary. Or put the new Unicode functions in a separate library. But don't arbitrarily drop them.
>>
>> Some folks have suggested that I maintain my own library of the deprecated Latin1 functions. But why should I have to do that? How does that help other folks with the same issue?
>>
>
> I had a few problems with it myself; I just finished updating rebar3 and dependencies to run on both >OTP-21 releases and stuff dating back to OTP-16. The problem we have is that we run on a backwards compat schedule that is stricter and longer than the OTP team.
>
> For example, string:join/2 is being replaced with lists:join/2, but lists:join/2 did not exist in R16 and string:join/2 is deprecated in OTP-21. So we needed to extract that function from OTP into custom modules everywhere, and replace old usage with new one.
>
> I was forced to add translation functions and modules like https://github.com/erlang/rebar3/blob/master/src/rebar_string.erl to the code base, along with this conditional define: https://github.com/erlang/rebar3/blob/master/rebar.config#L33
>
> It's a bit painful, but it ends up working quite alright. Frankly it's nicer if you can adopt OTP's deprecation pace, it's quite painful for us being on a larger sequence.
>
>> Bottom line: please please please do not drop the existing Latin-1 string functions.
>>
>> Please don't.
>>
>> Best wishes,
>>
>> LRP
>>
>
> It's probably alright if they keep warnings for a good period of time before dropping everything. OTP-21 starts warning so people who want to keep 'warning_as_errors' as an option will suffer the most.
>
> But overall, you can't escape Unicode. Work with JSON? There it is.  HTTP? There again. URLs? You bet. File systems? Hell yes! Erlang modules and app files: also yes!
>
> The one place I've seen a deliberate attempt at not doing it was with host names in certificate validation (or DNS), since there, making distinct domain names compare the same could be an attack vector. There you get to deal with the magic of punycode (https://en.wikipedia.org/wiki/Punycode) if you want to be safer.
>
> - Fred.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Strings - deprecated functions

Loïc Hoguin-3
In reply to this post by Fred Hebert-2
On 11/23/2017 03:45 AM, Fred Hebert wrote:

> But overall, you can't escape Unicode. Work with JSON? There it is.
> HTTP? There again. URLs? You bet. File systems? Hell yes! Erlang modules
> and app files: also yes!

Nitpick but...

HTTP itself is ASCII. There used to be support for characters in the
128..255 range but they are now obsolete.

URLs with Unicode must be encoded. Headers must be ASCII, if you need
Unicode there's https://tools.ietf.org/html/rfc8187 to the rescue, again
via urlencoding.

My point is, you can go very far with HTTP without ever bothering with
Unicode. And it seems like this will be true in the foreseeable future,
as the upcoming httptre is mostly clarifications to the current httpbis
specifications in preparation for QUIC.

I suspect the same is true for most of the older protocols. And I would
say that's a good thing as it considerably reduces the complexity of the
protocol.

Of course applications themselves are another matter entirely, and if
you deal with anything you haven't written yourself, chances are you
will need to deal with Unicode.

--
Loïc Hoguin
https://ninenines.eu
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Strings - deprecated functions

Anthony Ramine-4
In reply to this post by Lloyd R. Prentice-2

> Le 22 nov. 2017 à 20:43, [hidden email] a écrit :
>
> Dear Gods of Erlang,
>  
> "This module has been reworked in Erlang/OTP 20 to handle unicode:chardata() and operate on grapheme clusters. The old functions that only work on Latin-1 lists as input are still available but should not be used. They will be deprecated in Erlang/OTP 21."
>  
> I'm sorry. I've brought up this issue before and got lots of push back.
>  
> But every time I look up tried and true and long-used string functions to find that they are deprecated and will be dropped in future Erlang releases my blood pressure soars. Both my wife and my doctor tell me that at my age this is a dangerous thing.
>  
> I do understand the importance and necessity of Unicode. And applaud the addition of Unicode functions.
>  
> But the deprecated string functions have a long history. The English language and Latin-1 characters are widely used around the world.

You do know that Latin-1 cannot be used to represent all English words, right?

> Yes, it should be easy for programmers to translate code from one user language to another. But I'm not convinced that the Gods of Erlang have found the optimal solution by dropping all Latin-1 string functions.
>  
> My particular application is directed toward English speakers. So, until further notice, I have no use for Unicode.

Damn, I hope your users will never want to tell their friends how delicious was the hors-d'œuvre they ate yesterday.

> I don't want to sound like nationalist pig, but I think dropping the Latin-1 string functions from future Erlang releases is a BIG mistake.
>  
> I look up tokens/2, a function that I use fairly frequently, and I see that it's deprecated. I look up the suggested replacement and I see lexemes/2.
>  
> So I ask, what the ... is a lexeme? I look it up in Merriam-Webster and I see that a lexeme is  "a meaningful linguistic unit."
>  
> Meaning what? I just want to turn "this and that" into "This And That."
>  
> I read further in the Erlang docs and I see "grapheme cluster."  WHAT THE ... IS GRAPHEME CLUSTER?
>  
> I look up "grapheme" in Merriam-Webster. Oh it is now all so clear: "a unit of a writing system."
>  
> Ah yes, grapheme is defined in the docs. But I have to read and re-read the definition to understand what the God's of Erlang mean by a "graphene cluster." And I'm still not sure I get it.
>  
> It sounds like someone took a linguistics class and is trying to show off.
>  
> But now I've spent 30 minutes--- time that I don't have to waste trying to figure out how do a simple manipulation of "this and that." Recurse the next time I want to look up a string function in the Erlang docs.

IMO the functions should have been named according to "grapheme cluster", not "lexeme".

> SOLUTION
>  
> Keep the Latin-1 string functions. Put them in a separate library if necessary. Or put the new Unicode functions in a separate library. But don't arbitrarily drop them.
>  
> Some folks have suggested that I maintain my own library of the deprecated Latin1 functions. But why should I have to do that? How does that help other folks with the same issue?

The issue is that you want to keep using Latin-1 (which Latin-1 btw, you do know there are at least 2 of them? Do you know which one Erlang uses? Beware that's a tricky question) instead of switching to Unicode, which will benefit even your English users.

> Bottom line: please please please do not drop the existing Latin-1 string functions.
>  
> Please don't.
>  
> Best wishes,
>  
> LRP
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Strings - deprecated functions

Joe Armstrong-2
In reply to this post by Lloyd R. Prentice-2
I agree 100%

If you make major changes to a library module  that been around for a very long
time you will break a lot of old code.

At a first cast

   It's OK to add new functions to the module
   It's ok to fix bugs in old functions with changing the names

But it NOT ok to remove functions or change the semantics of existing
non-buggy functions.

It's not as if we'll run out of module names soon. Call the new module

   strungs_v1 strings_1 or better_strings or strings_improved or
anything you feel like but

please^1000 don't break my old code.

If I'm just reading code and I see strings:somefunc(...) then I'd very
much like to know
that there is only ONE version of the strings module.

If two things have the same name they should be the same.

One of the really really good things about Erlang is that I can take
20 year old code and
recompile it and it just works with a very high probability.

Please don't break my old code by changing the standard libraries.

Just as an aside - in my hobby projects all my library code is in one directory
(not multiple directories - this makes it easy to decide which
directory to put things in)

Libraries have names like  mod.erl thereafter mod_vsn1.erl
mod_vsn2.erl mod_vsn3.erl etc.
and I don't make big changes old module once written.

If by any mechanism whatsoever you can create a situation where two
things with the same
name are significantly different you'll end up with a large number of problems.

Really modules should have no names but be named by (say) the SHA1 of
their content
and that way there would be no naming errors - but we don't really
know how to do this in a convenient
way yet.

Cheers

/Joe








On Wed, Nov 22, 2017 at 8:43 PM,  <[hidden email]> wrote:

> Dear Gods of Erlang,
>
>
>
> "This module has been reworked in Erlang/OTP 20 to handle unicode:chardata()
> and operate on grapheme clusters. The old functions that only work on
> Latin-1 lists as input are still available but should not be used. They will
> be deprecated in Erlang/OTP 21."
>
>
>
> I'm sorry. I've brought up this issue before and got lots of push back.
>
>
>
> But every time I look up tried and true and long-used string functions to
> find that they are deprecated and will be dropped in future Erlang releases
> my blood pressure soars. Both my wife and my doctor tell me that at my age
> this is a dangerous thing.
>
>
>
> I do understand the importance and necessity of Unicode. And applaud the
> addition of Unicode functions.
>
>
>
> But the deprecated string functions have a long history. The English
> language and Latin-1 characters are widely used around the world.
>
>
>
> Yes, it should be easy for programmers to translate code from one user
> language to another. But I'm not convinced that the Gods of Erlang have
> found the optimal solution by dropping all Latin-1 string functions.
>
>
>
> My particular application is directed toward English speakers. So, until
> further notice, I have no use for Unicode.
>
>
>
> I don't want to sound like nationalist pig, but I think dropping the Latin-1
> string functions from future Erlang releases is a BIG mistake.
>
>
>
> I look up tokens/2, a function that I use fairly frequently, and I see that
> it's deprecated. I look up the suggested replacement and I see lexemes/2.
>
>
>
> So I ask, what the ... is a lexeme? I look it up in Merriam-Webster and I
> see that a lexeme is  "a meaningful linguistic unit."
>
>
>
> Meaning what? I just want to turn "this and that" into "This And That."
>
>
>
> I read further in the Erlang docs and I see "grapheme cluster."  WHAT THE
> ... IS GRAPHEME CLUSTER?
>
>
>
> I look up "grapheme" in Merriam-Webster. Oh it is now all so clear: "a unit
> of a writing system."
>
>
>
> Ah yes, grapheme is defined in the docs. But I have to read and re-read the
> definition to understand what the God's of Erlang mean by a "graphene
> cluster." And I'm still not sure I get it.
>
>
>
> It sounds like someone took a linguistics class and is trying to show off.
>
>
>
> But now I've spent 30 minutes--- time that I don't have to waste trying to
> figure out how do a simple manipulation of "this and that." Recurse the next
> time I want to look up a string function in the Erlang docs.
>
>
>
> SOLUTION
>
>
>
> Keep the Latin-1 string functions. Put them in a separate library if
> necessary. Or put the new Unicode functions in a separate library. But don't
> arbitrarily drop them.
>
>
>
> Some folks have suggested that I maintain my own library of the deprecated
> Latin1 functions. But why should I have to do that? How does that help other
> folks with the same issue?
>
>
>
> Bottom line: please please please do not drop the existing Latin-1 string
> functions.
>
>
>
> Please don't.
>
>
>
> Best wishes,
>
>
>
> LRP
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions
>
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Strings - deprecated functions

Joe Armstrong-2
In reply to this post by Loïc Hoguin-3
On Wed, Nov 22, 2017 at 9:28 PM, Loïc Hoguin <[hidden email]> wrote:
> Calm down. Considering how ubiquitous the string module is, these functions
> are not going to be removed for at least a few years. That gives you plenty
> of time to understand the new string module.

If you change it in a 1000 years time you're really going to confuse everybody.

Programs in 3020 will want to know which millenia the code was written.

There is no shortage of names.

Call the new module string_vsn1 and NOT string then string and string_vsn1
can co-exist *forever*

/Joe




>
> Perhaps during this journey you can help make the documentation for the
> module more user friendly by sending patches or opening tickets at
> bugs.erlang.org. I'll admit that the current documentation does confuse me
> personally, though I've not needed to use it yet.
>
> Unfortunately languages are complex and Unicode is therefore also complex.
> There's no real way around that. Even if you target English speakers it's
> likely that you will need Unicode, because many things require it like names
> or addresses for example. So even if it feels like you won't need it (and
> maybe you won't) it's a good idea to be ready for it.
>
> I wouldn't say latin1 is widely used anymore. Most of everything uses
> Unicode nowadays. Nearly everything switched to Unicode, Erlang is one of
> the last. Even your email was sent encoded in utf-8.
>
> On 11/22/2017 08:43 PM, [hidden email] wrote:
>>
>> Dear Gods of Erlang,
>>
>> "This module has been reworked in Erlang/OTP 20 to handle
>> unicode:chardata() <http://erlang.org/doc/man/unicode.html#type-chardata>
>> and operate on grapheme clusters. The old functions
>> <http://erlang.org/doc/man/string.html#oldapi> that only work on Latin-1
>> lists as input are still available but should not be used. They will be
>> deprecated in Erlang/OTP 21."
>>
>> I'm sorry. I've brought up this issue before and got lots of push back.
>>
>> But every time I look up tried and true and long-used string functions to
>> find that they are deprecated and will be dropped in future Erlang releases
>> my blood pressure soars. Both my wife and my doctor tell me that at my age
>> this is a dangerous thing.
>>
>> I do understand the importance and necessity of Unicode. And applaud the
>> addition of Unicode functions.
>>
>> But the deprecated string functions have a long history. The English
>> language and Latin-1 characters are widely used around the world.
>>
>> Yes, it should be easy for programmers to translate code from one user
>> language to another. But I'm not convinced that the Gods of Erlang have
>> found the optimal solution by dropping all Latin-1 string functions.
>>
>> My particular application is directed toward English speakers. So, until
>> further notice, I have no use for Unicode.
>>
>> I don't want to sound like nationalist pig, but I think dropping the
>> Latin-1 string functions from future Erlang releases is a BIG mistake.
>>
>> I look up tokens/2, a function that I use fairly frequently, and I see
>> that it's deprecated. I look up the suggested replacement and I see
>> lexemes/2.
>>
>> So I ask, what the ... is a lexeme? I look it up in Merriam-Webster and I
>> see that a lexeme is  "a meaningful linguistic unit."
>>
>> Meaning what? I just want to turn "this and that" into "This And That."
>>
>> I read further in the Erlang docs and I see "grapheme cluster."  WHAT THE
>> ... IS GRAPHEME CLUSTER?
>>
>> I look up "grapheme" in Merriam-Webster. Oh it is now all so clear: "a
>> unit of a writing system."
>>
>> Ah yes, grapheme is defined in the docs. But I have to read and re-read
>> the definition to understand what the God's of Erlang mean by a "graphene
>> cluster." And I'm still not sure I get it.
>>
>> It sounds like someone took a linguistics class and is trying to show off.
>>
>> But now I've spent 30 minutes--- time that I don't have to waste trying to
>> figure out how do a simple manipulation of "this and that." Recurse the next
>> time I want to look up a string function in the Erlang docs.
>>
>> SOLUTION
>>
>> Keep the Latin-1 string functions. Put them in a separate library if
>> necessary. Or put the new Unicode functions in a separate library. But don't
>> arbitrarily drop them.
>>
>> Some folks have suggested that I maintain my own library of the deprecated
>> Latin1 functions. But why should I have to do that? How does that help other
>> folks with the same issue?
>>
>> Bottom line: please please please do not drop the existing Latin-1 string
>> functions.
>>
>> Please don't.
>>
>> Best wishes,
>>
>> LRP
>>
>>
>>
>> _______________________________________________
>> erlang-questions mailing list
>> [hidden email]
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>
> --
> Loïc Hoguin
> https://ninenines.eu
>
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Strings - deprecated functions

Michał Muskała
In reply to this post by Joe Armstrong-2

On 23 Nov 2017, 17:35 +0100, Joe Armstrong <[hidden email]>, wrote:
I agree 100%

If you make major changes to a library module that been around for a very long
time you will break a lot of old code.

At a first cast

It's OK to add new functions to the module
It's ok to fix bugs in old functions with changing the names

But it NOT ok to remove functions or change the semantics of existing
non-buggy functions.

It's not as if we'll run out of module names soon. Call the new module

strungs_v1 strings_1 or better_strings or strings_improved or
anything you feel like but


While this sounds great, I will argue that it's not very practical. The primary problem is that somebody has now to maintain both versions of the code. And there are situation when even old code needs to change - the particular case we're probably all agree with is when security issues are discovered. If the team (which we assume is of fixed size) spends their time maintaining old code, they don't spend time developing new features. Resources are unfortunately limited.

Another downside of keeping all old implementations is that it decreases readability of code. Code is read much more often than written and should be optimised for reading. But now, each time I see strings:tokens/1 and strings_v1:tokens/1 I need to decide if they actually do the same when reading the code. And I need to do this every time I read the code. The same distinction is needed during an upgrade, but it's needed only once.

Michał.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Strings - deprecated functions

Joe Armstrong-2
On Thu, Nov 23, 2017 at 9:39 PM, Michał Muskała <[hidden email]> wrote:

>
> On 23 Nov 2017, 17:35 +0100, Joe Armstrong <[hidden email]>, wrote:
>
> I agree 100%
>
> If you make major changes to a library module that been around for a very
> long
> time you will break a lot of old code.
>
> At a first cast
>
> It's OK to add new functions to the module
> It's ok to fix bugs in old functions with changing the names
>
> But it NOT ok to remove functions or change the semantics of existing
> non-buggy functions.
>
> It's not as if we'll run out of module names soon. Call the new module
>
> strungs_v1 strings_1 or better_strings or strings_improved or
> anything you feel like but
>
>
> While this sounds great, I will argue that it's not very practical. The
> primary problem is that somebody has now to maintain both versions of the
> code.

Why maintain the old version?  As soon as there is a strings_v1 exist
then changes can be made there

> And there are situation when even old code needs to change - the
> particular case we're probably all agree with is when security issues are
> discovered.

Possibly - you could alwys issue a very strong compiler warning
"security problem in strings" please change to strings_v1


> If the team (which we assume is of fixed size) spends their time
> maintaining old code, they don't spend time developing new features.
> Resources are unfortunately limited.
>
> Another downside of keeping all old implementations is that it decreases
> readability of code. Code is read much more often than written and should be
> optimised for reading.

It should be but isn't - look at the history of erl_scan the first versions were
very readable - later versions were heavily optimised and far less readable

>  But now, each time I see strings:tokens/1 and
> strings_v1:tokens/1 I need to decide if they actually do the same when
> reading the code.

You could always write in strings_v1.erl

    tokens(X, Y) ->
         strings:tokens(X, Y).

and it would be abundantly clear - or use a  parse transform.


> And I need to do this every time I read the code. The same
> distinction is needed during an upgrade, but it's needed only once.

I still maintain that things with the same name must be the same.
As soon as you get two versions of strings offering different
functions then the name of the module 'strings.erl' becomes
ambiguous.

You have to say "I mean the version of strings in version 19.2 of Erlang"
Oh dear I thought you meant version 45.3"

If we use a name we should not have to qualify it by either the date
when the name was valid or by the checksum of the Git commit in which
it can be found.

Imagine what would happen if I could change my name on an arbitrary date

I was talking to joe the other day, when? 12 June 2015 - Oh you mean
when he was called
fred - No that was later he changed his name to Donald on the 23 th August 2016.

And what, is wrong with names like srings_vsn1, string_vsn_2 etc. it's
not as if the
integers are going to run out.


One thing I've always hated about revision control systems like GIT is
that the same name
means different things in different commits. This causes no end of
confusion and many errors.

Breaking peoples code by changing changing the libraries I view as a
fundamental sin.

After a few iterations you'll end up with two mutually incompatible
versions of a library
with the same name. One will export a with you want to use, the other
b which you also want to use.
But you cannot use both.

I have seen this in virtually every system I've every programmed.

Just invent a new name if you change things.

How difficult can that be?

/Joe


>
> Michał.
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Strings - deprecated functions

PAILLEAU Eric
Hi,

If Erlang had namespaces we could imagine old strings function if no namespace,  for old code compat, while new code could use same functions names, but with a namespace. Let say -namespace(strings, pref1, "http://www.erlang.org/strings/utf8"). In header.

When using functions with mixed namespaces,  Use of a prefix would solve the problem. 

pref1::strings:func() for new utf8 module or strings:func() for legacy latin1 ones.

This would also solve the very complicated problem of module name clashes from different repositories...

Regards 


Le 23 nov. 2017 11:00 PM, Joe Armstrong <[hidden email]> a écrit :

On Thu, Nov 23, 2017 at 9:39 PM, Michał Muskała <[hidden email]> wrote:
>
> On 23 Nov 2017, 17:35 +0100, Joe Armstrong <[hidden email]>, wrote:
>
> I agree 100%
>
> If you make major changes to a library module that been around for a very
> long
> time you will break a lot of old code.
>
> At a first cast
>
> It's OK to add new functions to the module
> It's ok to fix bugs in old functions with changing the names
>
> But it NOT ok to remove functions or change the semantics of existing
> non-buggy functions.
>
> It's not as if we'll run out of module names soon. Call the new module
>
> strungs_v1 strings_1 or better_strings or strings_improved or
> anything you feel like but
>
>
> While this sounds great, I will argue that it's not very practical. The
> primary problem is that somebody has now to maintain both versions of the
> code.

Why maintain the old version?  As soon as there is a strings_v1 exist
then changes can be made there

> And there are situation when even old code needs to change - the
> particular case we're probably all agree with is when security issues are
> discovered.

Possibly - you could alwys issue a very strong compiler warning
"security problem in strings" please change to strings_v1

> If the team (which we assume is of fixed size) spends their time
> maintaining old code, they don't spend time developing new features.
> Resources are unfortunately limited.
>
> Another downside of keeping all old implementations is that it decreases
> readability of code. Code is read much more often than written and should be
> optimised for reading.

It should be but isn't - look at the history of erl_scan the first versions were
very readable - later versions were heavily optimised and far less readable

>  But now, each time I see strings:tokens/1 and
> strings_v1:tokens/1 I need to decide if they actually do the same when
> reading the code.

You could always write in strings_v1.erl

    tokens(X, Y) ->
         strings:tokens(X, Y).

and it would be abundantly clear - or use a  parse transform.

> And I need to do this every time I read the code. The same
> distinction is needed during an upgrade, but it's needed only once.

I still maintain that things with the same name must be the same.
As soon as you get two versions of strings offering different
functions then the name of the module 'strings.erl' becomes
ambiguous.

You have to say "I mean the version of strings in version 19.2 of Erlang"
Oh dear I thought you meant version 45.3"

If we use a name we should not have to qualify it by either the date
when the name was valid or by the checksum of the Git commit in which
it can be found.

Imagine what would happen if I could change my name on an arbitrary date

I was talking to joe the other day, when? 12 June 2015 - Oh you mean
when he was called
fred - No that was later he changed his name to Donald on the 23 th August 2016.

And what, is wrong with names like srings_vsn1, string_vsn_2 etc. it's
not as if the
integers are going to run out.

One thing I've always hated about revision control systems like GIT is
that the same name
means different things in different commits. This causes no end of
confusion and many errors.

Breaking peoples code by changing changing the libraries I view as a
fundamental sin.

After a few iterations you'll end up with two mutually incompatible
versions of a library
with the same name. One will export a with you want to use, the other
b which you also want to use.
But you cannot use both.

I have seen this in virtually every system I've every programmed.

Just invent a new name if you change things.

How difficult can that be?

/Joe

>
> Michał.
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions



_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
123