String type

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

String type

Sam Overdorf
Has anyone considered making string a type and not a list of chars.

I seem to have a lot of trouble when a list is a bunch of string
objects and I start taking it apart with [H|T] = List..

 When processing the last string in the list I end up taking apart the
individual characters of the string. If I do a type-check it tells me
it is a list.

I usually have to do a work around to handle this. If it was a type I
would easily know when I am done with the list.

Thanks,
Sam
[hidden email]
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: String type

Fred Hebert-2
On 06/22, Sam Overdorf wrote:
>Has anyone considered making string a type and not a list of chars.
>

Yes, please see for example this discussion from earlier this month:
http://erlang.org/pipermail/erlang-questions/2018-June/095572.html

>I seem to have a lot of trouble when a list is a bunch of string
>objects and I start taking it apart with [H|T] = List..
>
> When processing the last string in the list I end up taking apart the
>individual characters of the string. If I do a type-check it tells me
>it is a list.
>

use string:to_graphemes(S) and you'll have a list of unicode-aware
characters that can be iterated over, in an encoding-neutral format:

1> string:to_graphemes("ß↑e̊").
[223,8593,[101,778]]
2> string:to_graphemes(<<"ß↑e̊"/utf8>>).
[223,8593,[101,778]]

Regards,
Fred.
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: String type

Kostis Sagonas-2
In reply to this post by Sam Overdorf
On 06/22/2018 11:41 PM, Sam Overdorf wrote:

> Has anyone considered making string a type and not a list of chars.
>
> I seem to have a lot of trouble when a list is a bunch of string
> objects and I start taking it apart with [H|T] = List..
>
>   When processing the last string in the list I end up taking apart the
> individual characters of the string. If I do a type-check it tells me
> it is a list.
>
> I usually have to do a work around to handle this. If it was a type I
> would easily know when I am done with the list.

The various pros and cons of having a special string type in Erlang have
been discussed before in this mailing list; please refer to the archives
of the list for finding them.

The arguments for a string data type are often more involved than yours
and typically are centered around space concerns.  If your only grief
with strings-as-lists is the above "programming confusion", may I
suggest that you wrap your strings inside a tuple-pair with a 'str' tag?

I.e., instead of a list of strings ["hello", "world"] you process lists
of the form [{str,"hello"}, {str,"world"}].

When you become more comfortable with list processing in general, you
can then drop the 'str' tuple wrappers from your code, and most probably
you will also realize that the list(string()) representation is not so
bad after all.

Kostis
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: String type

Sam Overdorf
In reply to this post by Sam Overdorf
Someone gave me a link to a previous discussion of this and it's problems.
I read it and decided that I need to change my process and not modify Erlang.

Thanks for the response,
Sam


On Mon, Jun 25, 2018 at 7:30 AM, Richard O'Keefe <[hidden email]> wrote:

> Yes, people have often considered adding a "real" string
> data type to Erlang.  With the move to 64-bit machines
> this became even more interesting.  However, in a Unicode
> world, it really is not clear what string *is*.
>
> For example, in the old ASCII days, it was clear that a
> string was a sequence of characters, and all characters
> were the same size, and the actual ASCII definition made
> it clear that NUL and DEL probably should NOT be allowed
> in a string.  US programmers (hence US textbooks, hence
> practically everyone in the English-speaking world)
> quietly ignored the fact that the ASCII standard explicitly
> allowed overstrikes so that you could get u-with-umlaut
> by doing <u> <BS> <"> or even <"> <BS> <u>.  So in fact in
> ASCII a "character" could well be a sequence of code-points
> and that is in fact why ` and ^ are in the ASCII set, and
> it wasn't therefore *true* that all characters were the
> same size.
>
> In the ISO 8851 family, the standardisers bowed to reality
> and forbade overstriking, introducing precomposed accented
> letters instead.  So the statement that ASCII is a subset
> of ISO 8859/1 is a half truth: the codepoints are a subset
> but ASCII allows you to DO things with them that Latin-1
> does not.
>
> Unicode has it both ways.  It has precomposed characters
> like u-with-umlaut, and it also has composed characters
> like u-followed-by-(floating umlaut).  Which means we
> now have to ask "is a string a sequence of codepoints
> or a sequence of characters". But it's more complicated.
> See Unicode Technical Annex 29 "Unicode Text Segmentation"
> for the horrible details.  But the alternatives are
>
> - sequence of bytes (in UTF8)
> - sequence of 16-bit units (UTF16)
> - sequence of code-points
> + sequence of legacy grapheme clusters
> + sequence of extended grapheme clusters
> + sequence of tailored grapheme clusters
> bearing in mind that
> * some code points are always illegal
> * most code points are unassigned
> * some sequences of code points are illegal
> * in particular, legal sequences may have
>   illegal subsequences, so the "substring"
>   operation is problematic.
>
> Let's not even try to think about the existence
> of multiple characters with identical appearance,
> multiple ways to encode many characters,
> invisible characters, characters forbidden by design
> then introduced then deprecated, and the question
> of whether control marks like redundant direction
> indicators should count in deciding whether strings
> are equal.
>
> If you are dealing with text where you are actually
> looking at the characters doing some sort of parsing,
> the chances are you want a list of tokens or even
> some sort of tree rather than a string.
>
> I'm actually more interested in the fact that you say
> you have trouble with lists of strings.  Can you
> provide an example of the kind of code you have
> trouble with?  If you use the Dialyzer, it has no
> trouble expressing the difference between a list of
> integers and a list of lists of integers, and even
> without it, it's not a commonly reported problem.
>
> For example, suppose we have a list of strings and
> want to paste them together with spaces between
> them.  This is called "unwords" in Haskell.  Let's
> start with the Haskell version.
>
> unwords :: [String] -> String
> unwords [] = []
> unwords (w:ws) = w ++ aux ws
>   where aux [] = []
>         aux (y:ys) = " " ++ y ++ aux ys
>
> Let's put that into Erlang:
>
> unwords([]) -> [];
> unwords([W|Ws]) -> W ++ unwords_aux(Ws).
>
> unwords_aux([]) -> "";
> unwords_aux([Y|Ys]) -> " " ++ Y ++ unwords_aux(Ys).
>
> By the way, this kind of thing is spectacularly
> inefficient in languages like Java, which is why Java
> has StringBuilder as well as String.  This is one of
> many reasons why I have a slogan STRINGS ARE WRONG.
>
>
>
> On 23 June 2018 at 09:41, Sam Overdorf <[hidden email]> wrote:
>>
>> Has anyone considered making string a type and not a list of chars.
>>
>> I seem to have a lot of trouble when a list is a bunch of string
>> objects and I start taking it apart with [H|T] = List..
>>
>>  When processing the last string in the list I end up taking apart the
>> individual characters of the string. If I do a type-check it tells me
>> it is a list.
>>
>> I usually have to do a work around to handle this. If it was a type I
>> would easily know when I am done with the list.
>>
>> Thanks,
>> Sam
>> [hidden email]
>> _______________________________________________
>> erlang-questions mailing list
>> [hidden email]
>> http://erlang.org/mailman/listinfo/erlang-questions
>
>
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: String type

Lloyd R. Prentice-2
Hi Richard,

I missed this post when it popped up the first time around. But, as usual, it explains much with great clarity.

But it still leaves me with profound frustration. At this point I realize that my frustration in part has to do with the highly technical function names in the new and improved string library-- names such as lexemes/2, next_codepoint/1, next_grapheme/1, etc.

I do understand that these names are quite precise relative to the concepts they're addressing.

But that's the problem for me. When I'm programming in my native language, English, I simply don't think in terms of these concepts. Yes, I could spend a day and build the conceptual bridges between my deprecated ascii-think and the Unicode way of thinking then burn the bridges behind me.

But since I don't imagine manipulating Urdu, Chinese, or Swedish text any time soon, as much as I'd love to be fluent in any one of these languages and others, the time spent building and burning bridges feels like an unproductive investment.

The technical world is going to Unicode and for good reasons. I get that.

But one thing might help clear the fog enormously:

A tutorial that explicitly maps the concepts of the deprecated strings to their replacements.

The current string reference take a stab. But I find it quite opaque.

I know that you're a busy guy. But it seems that you have the skills to clear the fog.

Think you squeeze out an hour or two to help an old man and others of my ilk move into the bright and shiny future?

All the best,

LRP
 

 






-----Original Message-----
From: "Sam Overdorf" <[hidden email]>
Sent: Monday, June 25, 2018 5:12pm
To: "Richard O'Keefe" <[hidden email]>
Cc: "Erlang Questions" <[hidden email]>
Subject: Re: [erlang-questions] String type

Someone gave me a link to a previous discussion of this and it's problems.
I read it and decided that I need to change my process and not modify Erlang.

Thanks for the response,
Sam


On Mon, Jun 25, 2018 at 7:30 AM, Richard O'Keefe <[hidden email]> wrote:

> Yes, people have often considered adding a "real" string
> data type to Erlang.  With the move to 64-bit machines
> this became even more interesting.  However, in a Unicode
> world, it really is not clear what string *is*.
>
> For example, in the old ASCII days, it was clear that a
> string was a sequence of characters, and all characters
> were the same size, and the actual ASCII definition made
> it clear that NUL and DEL probably should NOT be allowed
> in a string.  US programmers (hence US textbooks, hence
> practically everyone in the English-speaking world)
> quietly ignored the fact that the ASCII standard explicitly
> allowed overstrikes so that you could get u-with-umlaut
> by doing <u> <BS> <"> or even <"> <BS> <u>.  So in fact in
> ASCII a "character" could well be a sequence of code-points
> and that is in fact why ` and ^ are in the ASCII set, and
> it wasn't therefore *true* that all characters were the
> same size.
>
> In the ISO 8851 family, the standardisers bowed to reality
> and forbade overstriking, introducing precomposed accented
> letters instead.  So the statement that ASCII is a subset
> of ISO 8859/1 is a half truth: the codepoints are a subset
> but ASCII allows you to DO things with them that Latin-1
> does not.
>
> Unicode has it both ways.  It has precomposed characters
> like u-with-umlaut, and it also has composed characters
> like u-followed-by-(floating umlaut).  Which means we
> now have to ask "is a string a sequence of codepoints
> or a sequence of characters". But it's more complicated.
> See Unicode Technical Annex 29 "Unicode Text Segmentation"
> for the horrible details.  But the alternatives are
>
> - sequence of bytes (in UTF8)
> - sequence of 16-bit units (UTF16)
> - sequence of code-points
> + sequence of legacy grapheme clusters
> + sequence of extended grapheme clusters
> + sequence of tailored grapheme clusters
> bearing in mind that
> * some code points are always illegal
> * most code points are unassigned
> * some sequences of code points are illegal
> * in particular, legal sequences may have
>   illegal subsequences, so the "substring"
>   operation is problematic.
>
> Let's not even try to think about the existence
> of multiple characters with identical appearance,
> multiple ways to encode many characters,
> invisible characters, characters forbidden by design
> then introduced then deprecated, and the question
> of whether control marks like redundant direction
> indicators should count in deciding whether strings
> are equal.
>
> If you are dealing with text where you are actually
> looking at the characters doing some sort of parsing,
> the chances are you want a list of tokens or even
> some sort of tree rather than a string.
>
> I'm actually more interested in the fact that you say
> you have trouble with lists of strings.  Can you
> provide an example of the kind of code you have
> trouble with?  If you use the Dialyzer, it has no
> trouble expressing the difference between a list of
> integers and a list of lists of integers, and even
> without it, it's not a commonly reported problem.
>
> For example, suppose we have a list of strings and
> want to paste them together with spaces between
> them.  This is called "unwords" in Haskell.  Let's
> start with the Haskell version.
>
> unwords :: [String] -> String
> unwords [] = []
> unwords (w:ws) = w ++ aux ws
>   where aux [] = []
>         aux (y:ys) = " " ++ y ++ aux ys
>
> Let's put that into Erlang:
>
> unwords([]) -> [];
> unwords([W|Ws]) -> W ++ unwords_aux(Ws).
>
> unwords_aux([]) -> "";
> unwords_aux([Y|Ys]) -> " " ++ Y ++ unwords_aux(Ys).
>
> By the way, this kind of thing is spectacularly
> inefficient in languages like Java, which is why Java
> has StringBuilder as well as String.  This is one of
> many reasons why I have a slogan STRINGS ARE WRONG.
>
>
>
> On 23 June 2018 at 09:41, Sam Overdorf <[hidden email]> wrote:
>>
>> Has anyone considered making string a type and not a list of chars.
>>
>> I seem to have a lot of trouble when a list is a bunch of string
>> objects and I start taking it apart with [H|T] = List..
>>
>>  When processing the last string in the list I end up taking apart the
>> individual characters of the string. If I do a type-check it tells me
>> it is a list.
>>
>> I usually have to do a work around to handle this. If it was a type I
>> would easily know when I am done with the list.
>>
>> Thanks,
>> Sam
>> [hidden email]
>> _______________________________________________
>> erlang-questions mailing list
>> [hidden email]
>> http://erlang.org/mailman/listinfo/erlang-questions
>
>
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions