string:lexeme/s2 - an old man's rant

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

string:lexeme/s2 - an old man's rant

Lloyd R. Prentice-2
Hi,

This has come up before with various work-arounds suggested. Apologies for this old-man's rant, but every time I run across the impending death of string:tokens/2 to the glory of string:lexemes/2 my blood pressure rises.

I HATE IT. I HATE IT. I HATE IT, not least because the terms lexeme and grapheme are ugly inside-baseball words. Reading the docs, I have to do a Google search to understand what these obscure terms are referring to-- precious time wasted. And with my waning years, I don't have time to waste.

Even my spell-checker doesn't recognize them.

I get the desirability of welcoming unicode into Erlang. But can't we come up with friendlier nomenclature or, at least revise the docs so they don't sound like copy-and-paste out a academic linguistics journal?

Grrr.

LRP


*********************************************
My books:

THE GOSPEL OF ASHES
http://thegospelofashes.com

Strength is not enough. Do they have the courage
and the cunning? Can they survive long enough to
save the lives of millions?  

FREEIN' PANCHO
http://freeinpancho.com

A community of misfits help a troubled boy find his way

AYA TAKEO
http://ayatakeo.com

Star-crossed love, war and power in an alternative
universe

Available through Amazon or by request from your
favorite bookstore


**********************************************

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: string:lexeme/s2 - an old man's rant

Sam Overdorf
remember "Occam's razor".
I like simple and easy to use myself.

Sam




On Mon, May 6, 2019 at 2:40 PM <[hidden email]> wrote:

>
> Hi,
>
> This has come up before with various work-arounds suggested. Apologies for this old-man's rant, but every time I run across the impending death of string:tokens/2 to the glory of string:lexemes/2 my blood pressure rises.
>
> I HATE IT. I HATE IT. I HATE IT, not least because the terms lexeme and grapheme are ugly inside-baseball words. Reading the docs, I have to do a Google search to understand what these obscure terms are referring to-- precious time wasted. And with my waning years, I don't have time to waste.
>
> Even my spell-checker doesn't recognize them.
>
> I get the desirability of welcoming unicode into Erlang. But can't we come up with friendlier nomenclature or, at least revise the docs so they don't sound like copy-and-paste out a academic linguistics journal?
>
> Grrr.
>
> LRP
>
>
> *********************************************
> My books:
>
> THE GOSPEL OF ASHES
> http://thegospelofashes.com
>
> Strength is not enough. Do they have the courage
> and the cunning? Can they survive long enough to
> save the lives of millions?
>
> FREEIN' PANCHO
> http://freeinpancho.com
>
> A community of misfits help a troubled boy find his way
>
> AYA TAKEO
> http://ayatakeo.com
>
> Star-crossed love, war and power in an alternative
> universe
>
> Available through Amazon or by request from your
> favorite bookstore
>
>
> **********************************************
>
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: string:lexeme/s2 - an old man's rant

Hugo Mills-2
In reply to this post by Lloyd R. Prentice-2
On Mon, May 06, 2019 at 05:40:48PM -0400, [hidden email] wrote:
> Hi,
>
> This has come up before with various work-arounds suggested. Apologies for this old-man's rant, but every time I run across the impending death of string:tokens/2 to the glory of string:lexemes/2 my blood pressure rises.
>
> I HATE IT. I HATE IT. I HATE IT, not least because the terms lexeme and grapheme are ugly inside-baseball words. Reading the docs, I have to do a Google search to understand what these obscure terms are referring to-- precious time wasted. And with my waning years, I don't have time to waste.
>
> Even my spell-checker doesn't recognize them.
>
> I get the desirability of welcoming unicode into Erlang. But can't we come up with friendlier nomenclature or, at least revise the docs so they don't sound like copy-and-paste out a academic linguistics journal?


   Most of the other words you might want to use are already in use
for other things. Modern (computer) representation of writing systems
is complicated, and there's not enough words to go round the existing
concepts. Particularly words without well-known and either misleading
or overly-narrow definitions -- see my comment on "letters", below.

   For the two particular words you're complaining of here, I think of
them thus:

   graphemes, like graphology(*), are to do with the way that
   something's written on the page -- the shape and composition of the
   symbols. It's essentially a letter plus all of its diacritics (but
   it's not defined as such, because there are some graphemes that are
   ligatures of two or more letters, and some languages where each
   grapheme is a word in its own right).

   lexemes, like a lexicon, are to do with words, and are therefore
   groups of (certain kinds of) graphemes.

   Hugo.

(*) For all that it's unsubstantiated in its psychometric claims.

--
Hugo Mills             | I can't foretell the future, I just work there.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4          |                                            The Doctor

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

signature.asc (853 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: string:lexeme/s2 - an old man's rant

Richard O'Keefe
Words ending with the morpheme "-eme" generally come from linguistics.
In particular, "grapheme" can only be defined with respect to a
particular writing system.  "In linguistics, a grapheme is the smallest unit of a writing system of any given language."  For example, in English,
"ë" is two graphemes, an "e" grapheme, and a "pronounce this vowel
separately" grapheme.  In other European languages, "e" and "ë" are
quite separate letters.

What Hugo Mills described is not a grapheme but a grapheme *cluster*.

We have code unit, code point, glyph, character, grapheme, grapheme
cluster, and a bunch of other terms that are pretty much identical
in ASCII or ISO 8859 but when you make a serious attempt to encode
all the scripts anyone wants to use on a computer, things get
horribly complicated.  And they get complicated in *language-specific*
ways.  (Like case conversion.  You can't really do case conversion in
Unicode without knowing what language you are concerned with.)  This
always *was* complicated in the real world, but people in Western
Europe and the Americas were mostly able to ignore it.  (Things got
somewhat complicated in NZ where the indigenous language uses a
Latin-based script with macrons and where wh and ng count as single
letters.)

Curiously, in the Unicode 12 standard, "grapheme" is not in the index,
but "grapheme base", "grapheme cluster", and "grapheme extender", for
example, are.

I suspect that the word "grapheme", precisely because it is a
language-dependent technical term with some surprising twists,
may not be a good word to use here.

"Lexeme" is, if anything worse. "A lexeme  is a unit of lexical meaning that
underlies a set of words that are related through inflection. It is a basic
abstract unit of meaning, a unit of morphological analysis in linguistics that
roughly corresponds to a set of forms taken by a single root word."  That is
NOT what it means here.  In computing, it basically means "token".  But what
*does* it mean?  In "Now we see it, now we don't." are there two lexemes
spelled "we" or is there one "lexeme" with two occurrences?  (If you ever
meet two linguists in a bar who don't know each other, try asking them what
a "word" is.  There are at least four different meanings.)

"token" has the merit of coming from one half of the type/token distinction.
In fact, that's *WHY* they are called tokens.  In "Now we see it, now we
don't" there is ONE word type "we" which has TWO tokens.

So seriously, as someone who has been reading academic linguistics for
several decades and has spent more time trying to understand Unicode than
is compatible with sanity, I think the OP's objection carries weight.
(I said I've been *reading* the stuff.  That's not always the same as
*understanding* it, and I certainly couldn't *write* like a linguist.)


On Tue, 7 May 2019 at 19:56, Hugo Mills <[hidden email]> wrote:
On Mon, May 06, 2019 at 05:40:48PM -0400, [hidden email] wrote:
> Hi,
>
> This has come up before with various work-arounds suggested. Apologies for this old-man's rant, but every time I run across the impending death of string:tokens/2 to the glory of string:lexemes/2 my blood pressure rises.
>
> I HATE IT. I HATE IT. I HATE IT, not least because the terms lexeme and grapheme are ugly inside-baseball words. Reading the docs, I have to do a Google search to understand what these obscure terms are referring to-- precious time wasted. And with my waning years, I don't have time to waste.
>
> Even my spell-checker doesn't recognize them.
>
> I get the desirability of welcoming unicode into Erlang. But can't we come up with friendlier nomenclature or, at least revise the docs so they don't sound like copy-and-paste out a academic linguistics journal?


   Most of the other words you might want to use are already in use
for other things. Modern (computer) representation of writing systems
is complicated, and there's not enough words to go round the existing
concepts. Particularly words without well-known and either misleading
or overly-narrow definitions -- see my comment on "letters", below.

   For the two particular words you're complaining of here, I think of
them thus:

   graphemes, like graphology(*), are to do with the way that
   something's written on the page -- the shape and composition of the
   symbols. It's essentially a letter plus all of its diacritics (but
   it's not defined as such, because there are some graphemes that are
   ligatures of two or more letters, and some languages where each
   grapheme is a word in its own right).

   lexemes, like a lexicon, are to do with words, and are therefore
   groups of (certain kinds of) graphemes.

   Hugo.

(*) For all that it's unsubstantiated in its psychometric claims.

--
Hugo Mills             | I can't foretell the future, I just work there.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4          |                                            The Doctor
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: string:lexeme/s2 - an old man's rant

Lloyd R. Prentice-2
Hi Folks,

I’m certainly not smart enough to resolve this issue. But it seems somewhat like the tension between writing software that solves every imaginable problem in the domain and software that solves the immediate problem at hand.  We know in the first case that mounting complexity quickly gets out of hand.

If I expect users from every one of the 7,000-some-odd living languages to use my web app, then no doubt Unicode beats ASCII hands down. But how should I handle prompts? A 7,000-option case statement maybe? And where do I find the translators?

Aside from that, I have no quarrel with Unicode. I’m grateful that it enables my programs to respect many language conventions. But I would much prefer keeping string:tokens/2 in the Erlang string library and renaming string:lexemes/2 to something like string:unicode_tokens/2. If nothing else, this would take a considerable burden off the documentation.

But what do I know?

All the best,

Lloyd

Sent from my iPad

On May 7, 2019, at 9:45 AM, Richard O'Keefe <[hidden email]> wrote:

Words ending with the morpheme "-eme" generally come from linguistics.
In particular, "grapheme" can only be defined with respect to a
particular writing system.  "In linguistics, a grapheme is the smallest unit of a writing system of any given language."  For example, in English,
"ë" is two graphemes, an "e" grapheme, and a "pronounce this vowel
separately" grapheme.  In other European languages, "e" and "ë" are
quite separate letters.

What Hugo Mills described is not a grapheme but a grapheme *cluster*.

We have code unit, code point, glyph, character, grapheme, grapheme
cluster, and a bunch of other terms that are pretty much identical
in ASCII or ISO 8859 but when you make a serious attempt to encode
all the scripts anyone wants to use on a computer, things get
horribly complicated.  And they get complicated in *language-specific*
ways.  (Like case conversion.  You can't really do case conversion in
Unicode without knowing what language you are concerned with.)  This
always *was* complicated in the real world, but people in Western
Europe and the Americas were mostly able to ignore it.  (Things got
somewhat complicated in NZ where the indigenous language uses a
Latin-based script with macrons and where wh and ng count as single
letters.)

Curiously, in the Unicode 12 standard, "grapheme" is not in the index,
but "grapheme base", "grapheme cluster", and "grapheme extender", for
example, are.

I suspect that the word "grapheme", precisely because it is a
language-dependent technical term with some surprising twists,
may not be a good word to use here.

"Lexeme" is, if anything worse. "A lexeme  is a unit of lexical meaning that
underlies a set of words that are related through inflection. It is a basic
abstract unit of meaning, a unit of morphological analysis in linguistics that
roughly corresponds to a set of forms taken by a single root word."  That is
NOT what it means here.  In computing, it basically means "token".  But what
*does* it mean?  In "Now we see it, now we don't." are there two lexemes
spelled "we" or is there one "lexeme" with two occurrences?  (If you ever
meet two linguists in a bar who don't know each other, try asking them what
a "word" is.  There are at least four different meanings.)

"token" has the merit of coming from one half of the type/token distinction.
In fact, that's *WHY* they are called tokens.  In "Now we see it, now we
don't" there is ONE word type "we" which has TWO tokens.

So seriously, as someone who has been reading academic linguistics for
several decades and has spent more time trying to understand Unicode than
is compatible with sanity, I think the OP's objection carries weight.
(I said I've been *reading* the stuff.  That's not always the same as
*understanding* it, and I certainly couldn't *write* like a linguist.)


On Tue, 7 May 2019 at 19:56, Hugo Mills <[hidden email]> wrote:
On Mon, May 06, 2019 at 05:40:48PM -0400, [hidden email] wrote:
> Hi,
>
> This has come up before with various work-arounds suggested. Apologies for this old-man's rant, but every time I run across the impending death of string:tokens/2 to the glory of string:lexemes/2 my blood pressure rises.
>
> I HATE IT. I HATE IT. I HATE IT, not least because the terms lexeme and grapheme are ugly inside-baseball words. Reading the docs, I have to do a Google search to understand what these obscure terms are referring to-- precious time wasted. And with my waning years, I don't have time to waste.
>
> Even my spell-checker doesn't recognize them.
>
> I get the desirability of welcoming unicode into Erlang. But can't we come up with friendlier nomenclature or, at least revise the docs so they don't sound like copy-and-paste out a academic linguistics journal?


   Most of the other words you might want to use are already in use
for other things. Modern (computer) representation of writing systems
is complicated, and there's not enough words to go round the existing
concepts. Particularly words without well-known and either misleading
or overly-narrow definitions -- see my comment on "letters", below.

   For the two particular words you're complaining of here, I think of
them thus:

   graphemes, like graphology(*), are to do with the way that
   something's written on the page -- the shape and composition of the
   symbols. It's essentially a letter plus all of its diacritics (but
   it's not defined as such, because there are some graphemes that are
   ligatures of two or more letters, and some languages where each
   grapheme is a word in its own right).

   lexemes, like a lexicon, are to do with words, and are therefore
   groups of (certain kinds of) graphemes.

   Hugo.

(*) For all that it's unsubstantiated in its psychometric claims.

--
Hugo Mills             | I can't foretell the future, I just work there.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4          |                                            The Doctor
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: string:lexeme/s2 - an old man's rant

empro2
On Tue, 7 May 2019 12:04:46 -0400
"Lloyd R. Prentice" <[hidden email]> wrote:

> language conventions. But I would much prefer keeping
> string:tokens/2 in the Erlang string library and renaming
> string:lexemes/2 to something like
> string:unicode_tokens/2. If nothing else, this would take
> a considerable burden off the documentation.

Both names force meaning onto mere substrings plucked from
some string argument chopped to pieces at some separator
characters (with `tokens`) or substrings (`lexemes`).

The author cannot know what the resulting substrings mean to
the user, may be tokens, may be lexemes, may simply be
substrings for whatever use substrings might be useful
for; the "key=value" strings from a query-string chopped at
"&" are neither tokens nor lexemes.

I would provide an option to return empty substrings (for
counting) or not, instead of imperative `split` and
foisting `token` and `lexeme`.

This is a good example of things I have been collecting
about the documentation (so far I have been brainstorming
in my chamber without you):

        "two or more adjacent separator graphemes clusters
        in String are treated as one."

No-one cares: users looking up the spec want to know
whether they get empty substrings or not -- and how:
with such an option to one `substring` function they
get to know unmediately, as things are one needs to guess
from `split` to `token` to `lexeme` or fro. Qizzy! or they
employ a text search for "substring" to end up where I
would have begun ... (I hope :-)

Moreover: why at all treat two adjacent separators
specially? And more thinking takes the users further away
from whatever they were trying to accomplish ...


        "Notice that [$\r,$\n] is one grapheme cluster."

as is any character list = string? This note drives me to
confusion, requires me to step one meta-layer further away
from whatever I was trying to implement or design. Did I
misunderstand something? Up until here I thought
`Newline_separators = [[$\n], [$\r], [$\r, $\n]]`, but
now, why mention the obvious ...?


        "Where, default leading, indicates whether the
        leading, the trailing or all encounters of
        SearchPattern will split String."

Leading and trailing separators do not really separate, the
example (I love examples at specs :-) shows that more
probably "first" and "last" are meant. Now who would want
to dig up the documentation out of the repo, change, index,
commit, push (or whatever), set up a pull request and ...
try to remember what they were doing? Reference, with
examples (and possibly some *distinct* implementation
rationale, if that is the right place) and User Guide are
not code (nor code comments, I am starting to grow doubts
about all these JavaDoc, erldoc, ...doc thingies).

(*Note: "spec" above means 'reference', not the module
attribute, I know I should change them ...*)

Reference and Guide are to be read by people who do not
know what is meant -- code, comments and implementation
details are for those who know too much, it is too much
effort to change ones view from implementing know-it-all
to unknowing user. So some know to little and others too
much. Of course, ex nihilo nihil fit, so the documentation
needs to be prepared by those who know too much and some
wiki could be a good way to get improvements by those who
know too little.

My! such loads of prose to lay down what is some simple
thought in my head ... Sorry! (somehow :-)

Now I can possibly throw away the other draft in which I
have been collecting and trying to arrange many (all?) of
those things mentioned above over the previous
months ... :-)

~Michael

--

Time is not money, but money is time: life-time people have
spent transforming their environment.








_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: string:lexeme/s2 - an old man's rant

Lloyd R. Prentice-2
Hi Michael,

Your point re users, e.g.,  documentation consumers, who know too little vs. users who know too much is well taken. Count me in the first bunch.

> Both names force meaning onto mere substrings plucked from
> some string argument chopped to pieces at some separator
> characters (with `tokens`) or substrings (`lexemes`).

If I had a god-like wand I’d do a survey of all instances in all computer languages in which the programmer intends to split natural language text into a list at indices that mark the beginning of some predefined sub-segment of the text where that sub segment may recur zero to n times.

Yuk! Even trying to describe the problem abstractly gets ugly fast since, as Richard astutely points out, we don’t have terminology we can agree on. Is a “string” a passage of natural language, an array of bytes, an arrangement of bits irrespective of byte boundaries, an Erlang list, or some other entity?

Seems to me that Unicode valiantly wrestles with the problem of mapping analog representations of “meaning” into digital representation. Problem is that there are countless ways of making analog marks on paper, stone, or what have you. Even the universe of marks used by the 7,000+ living languages is a formidable number. When we try to map these into the digital realm we either shamefully waste memory resources by giving them all equal length, or we’re forced to deal with the nasty problem of determining where one mark ends and the next begins in our digital space.

ASCII solves this problem quite elegantly at the price of excluding much of the world’s population. Unicode is far more inclusive at the price of greater code complexity and muddled discourse re naming of parts. You pay you money and you take your choice.

I’m arguing for choice. Keep the simple ASCII-based string functions in the Erlang string library and either create a separate Unicode library or provide Unicode string functions with with more suggestive/evocative  names.

All the best,

Lloyd

P.S. Michael— I’m all for clearer documentation with illustrative examples.






















Sent from my iPad

> On May 7, 2019, at 4:17 PM, <[hidden email]> <[hidden email]> wrote:
>
> Both names force meaning onto mere substrings plucked from
> some string argument chopped to pieces at some separator
> characters (with `tokens`) or substrings (`lexemes`).

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: string:lexeme/s2 - an old man's rant

Richard O'Keefe
In reply to this post by empro2
For what it's worth, in Unicode, Line Separator and Paragraph
Separator are the recommended characters, with CR, LF, CR+LF,
and of arguably NEL (U+0085) being "legacy".

Again for what it's worth, Unicode defines an algorithm for
breaking text into word( token)s.

On Wed, 8 May 2019 at 08:17, <[hidden email]> wrote:
On Tue, 7 May 2019 12:04:46 -0400
"Lloyd R. Prentice" <[hidden email]> wrote:

> language conventions. But I would much prefer keeping
> string:tokens/2 in the Erlang string library and renaming
> string:lexemes/2 to something like
> string:unicode_tokens/2. If nothing else, this would take
> a considerable burden off the documentation.

Both names force meaning onto mere substrings plucked from
some string argument chopped to pieces at some separator
characters (with `tokens`) or substrings (`lexemes`).

The author cannot know what the resulting substrings mean to
the user, may be tokens, may be lexemes, may simply be
substrings for whatever use substrings might be useful
for; the "key=value" strings from a query-string chopped at
"&" are neither tokens nor lexemes.

I would provide an option to return empty substrings (for
counting) or not, instead of imperative `split` and
foisting `token` and `lexeme`.

This is a good example of things I have been collecting
about the documentation (so far I have been brainstorming
in my chamber without you):

        "two or more adjacent separator graphemes clusters
        in String are treated as one."

No-one cares: users looking up the spec want to know
whether they get empty substrings or not -- and how:
with such an option to one `substring` function they
get to know unmediately, as things are one needs to guess
from `split` to `token` to `lexeme` or fro. Qizzy! or they
employ a text search for "substring" to end up where I
would have begun ... (I hope :-)

Moreover: why at all treat two adjacent separators
specially? And more thinking takes the users further away
from whatever they were trying to accomplish ...


        "Notice that [$\r,$\n] is one grapheme cluster."

as is any character list = string? This note drives me to
confusion, requires me to step one meta-layer further away
from whatever I was trying to implement or design. Did I
misunderstand something? Up until here I thought
`Newline_separators = [[$\n], [$\r], [$\r, $\n]]`, but
now, why mention the obvious ...?


        "Where, default leading, indicates whether the
        leading, the trailing or all encounters of
        SearchPattern will split String."

Leading and trailing separators do not really separate, the
example (I love examples at specs :-) shows that more
probably "first" and "last" are meant. Now who would want
to dig up the documentation out of the repo, change, index,
commit, push (or whatever), set up a pull request and ...
try to remember what they were doing? Reference, with
examples (and possibly some *distinct* implementation
rationale, if that is the right place) and User Guide are
not code (nor code comments, I am starting to grow doubts
about all these JavaDoc, erldoc, ...doc thingies).

(*Note: "spec" above means 'reference', not the module
attribute, I know I should change them ...*)

Reference and Guide are to be read by people who do not
know what is meant -- code, comments and implementation
details are for those who know too much, it is too much
effort to change ones view from implementing know-it-all
to unknowing user. So some know to little and others too
much. Of course, ex nihilo nihil fit, so the documentation
needs to be prepared by those who know too much and some
wiki could be a good way to get improvements by those who
know too little.

My! such loads of prose to lay down what is some simple
thought in my head ... Sorry! (somehow :-)

Now I can possibly throw away the other draft in which I
have been collecting and trying to arrange many (all?) of
those things mentioned above over the previous
months ... :-)

~Michael

--

Time is not money, but money is time: life-time people have
spent transforming their environment.








_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: string:lexeme/s2 - an old man's rant

Lloyd R. Prentice-2
Hi Richard,

Thanks for clarifying the inner workings of Unicode. 

Which makes me wonder—- If string:tokens/2 and string:lexemes/2 are functionally identical, or at least substitutable, why not change the implementation of string:tokens/2 to accommodate Unicode, leave the function name alone, and announce to the world that as of Erlang Version XX the implementation of string:tokens/2 has been changed to accommodate Unicode?

Then we don’t have to worry about revising legacy code at some point in the future. Yes, I understand that the legacy code might have to be recompiled under the new version of Erlang in the case Unicode becomes universal.  But that seems to me a smaller price than revising source code.

A simple example that I ran into yesterday while proofreading Build It with Nitrogen, the book that Jesse Gumm and I have been working on for far too long now:

We used the function string:tokens/2 moons ago to parse a date string in the form  “04/07/19”.  String:tokens/2 was in good standing when we wrote the chapter. Had we published the book in a timely fashion, our readers today might think, oh this book is no good. It uses obsolete functions. 

I could have changed the function to string:lexemes/2. But if my mind goes tilt when  I look at the documentation, what can I expect of my readers? I ended up changing it to re:split/3.

All the best,

Lloyd

Sent from my iPad

On May 7, 2019, at 6:53 PM, Richard O'Keefe <[hidden email]> wrote:

For what it's worth, in Unicode, Line Separator and Paragraph
Separator are the recommended characters, with CR, LF, CR+LF,
and of arguably NEL (U+0085) being "legacy".

Again for what it's worth, Unicode defines an algorithm for
breaking text into word( token)s.

On Wed, 8 May 2019 at 08:17, <[hidden email]> wrote:
On Tue, 7 May 2019 12:04:46 -0400
"Lloyd R. Prentice" <[hidden email]> wrote:

> language conventions. But I would much prefer keeping
> string:tokens/2 in the Erlang string library and renaming
> string:lexemes/2 to something like
> string:unicode_tokens/2. If nothing else, this would take
> a considerable burden off the documentation.

Both names force meaning onto mere substrings plucked from
some string argument chopped to pieces at some separator
characters (with `tokens`) or substrings (`lexemes`).

The author cannot know what the resulting substrings mean to
the user, may be tokens, may be lexemes, may simply be
substrings for whatever use substrings might be useful
for; the "key=value" strings from a query-string chopped at
"&" are neither tokens nor lexemes.

I would provide an option to return empty substrings (for
counting) or not, instead of imperative `split` and
foisting `token` and `lexeme`.

This is a good example of things I have been collecting
about the documentation (so far I have been brainstorming
in my chamber without you):

        "two or more adjacent separator graphemes clusters
        in String are treated as one."

No-one cares: users looking up the spec want to know
whether they get empty substrings or not -- and how:
with such an option to one `substring` function they
get to know unmediately, as things are one needs to guess
from `split` to `token` to `lexeme` or fro. Qizzy! or they
employ a text search for "substring" to end up where I
would have begun ... (I hope :-)

Moreover: why at all treat two adjacent separators
specially? And more thinking takes the users further away
from whatever they were trying to accomplish ...


        "Notice that [$\r,$\n] is one grapheme cluster."

as is any character list = string? This note drives me to
confusion, requires me to step one meta-layer further away
from whatever I was trying to implement or design. Did I
misunderstand something? Up until here I thought
`Newline_separators = [[$\n], [$\r], [$\r, $\n]]`, but
now, why mention the obvious ...?


        "Where, default leading, indicates whether the
        leading, the trailing or all encounters of
        SearchPattern will split String."

Leading and trailing separators do not really separate, the
example (I love examples at specs :-) shows that more
probably "first" and "last" are meant. Now who would want
to dig up the documentation out of the repo, change, index,
commit, push (or whatever), set up a pull request and ...
try to remember what they were doing? Reference, with
examples (and possibly some *distinct* implementation
rationale, if that is the right place) and User Guide are
not code (nor code comments, I am starting to grow doubts
about all these JavaDoc, erldoc, ...doc thingies).

(*Note: "spec" above means 'reference', not the module
attribute, I know I should change them ...*)

Reference and Guide are to be read by people who do not
know what is meant -- code, comments and implementation
details are for those who know too much, it is too much
effort to change ones view from implementing know-it-all
to unknowing user. So some know to little and others too
much. Of course, ex nihilo nihil fit, so the documentation
needs to be prepared by those who know too much and some
wiki could be a good way to get improvements by those who
know too little.

My! such loads of prose to lay down what is some simple
thought in my head ... Sorry! (somehow :-)

Now I can possibly throw away the other draft in which I
have been collecting and trying to arrange many (all?) of
those things mentioned above over the previous
months ... :-)

~Michael

--

Time is not money, but money is time: life-time people have
spent transforming their environment.








_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: string:lexeme/s2 - an old man's rant

Richard O'Keefe
Let's look at the documentation for tokens/2:

http://erlang.org/doc/man/string.html#tokens-2

The first thing I notice is that we are told *that*
the function is obsolete but not *why* it is, and
that's important.

The second thing I notice is that we are told
to use lexemes/2 instead, but we are not told *how*
to do that.  An example showing an old call and its
new equivalent would do wonders.

The third thing I notice is the reason that the
second thing matters.  Consider the following
examples:
  tokens("aaa", "x") => ["aaa"]
  tokens("aa", "x")  => ["aa"]
  tokens("a", "x")   => ["a"]
so by continuity we expect
  tokens("", "x")    => [""]
BUT the result is actually [].  True, the
description says that the result is a list
of non-empty strings, but I don't really see
why that is so important that our natural
expectation that tokens(S, [X]) => [X]
whenever S is *any* string not containing X
should be violated, and if it is, then I
would definitely expect an exception.

The fourth thing I notice is that the treatment
of multi-element separator lists is odd.  I have
had occasion to use separators with more than
one code-point, and for Unicode that could be
essential.  I have also had occasion to use
split at C1, then at C2, then at C3, then at C4, ...
I've also had occasion to split on one separator
and then split the pieces into smaller pieces,
so multiple levels of splitting.  (Think of
/etc/passwd for a simple example.)  But the only
time I ever want multiple *alternative* separators
is when asking for white-space separation, and
*that* is when I want non-empty pieces.  It is
also the only time I ever want separators coalesced.
Given a string like "||x|yy||w" and the separator
"|", I've always wanted ["","","x","yy","","w"]
as the answer.  But there's a particular point
here:  which of us knows off-hand just what all
the Zs, Zl, and Zp characters of Unicode actually
are?  It would make a *lot* of sense to have
   tokens(String) -> list of non-empty pieces
   tokens(String, Sep) -> list of possibly empty
     pieces separated by the non-empty substring Sep.

The fifth thing I notice is that there is no
specification of what happens if SeparatorList is
empty.

All things considered, this is a function I am never
going to use, because it is less work to write my own
than to try to figure out this documentation.  And I
had to look at the code to figure some of it out.

I get seriously confused by some of the code in
string.erl.  We find
%% Fetch first grapheme cluster ..
next_grapheme(CD) -> ..
Which is it?  Grapheme or grapheme cluster?  These
are *different* (but overlapping) things!  And
where is the locale argument so that the function
knows what a "user-perceived character" actually *is*?
How come an empty list counts as a grapheme_cluster()?

What if I have something like
"foo:bar::uggle::zoosh" and I want to split it at
"::" but NOT at ":"?  "::" is not a grapheme cluster,'
so it looks like neither of these functions will help
me.

Writing good documentation is HARD.  At dear departed
Quintus, we started with a full time technical writer
and expanded to three, nearly as many as developers.

The *name* 'lexemes' is arguably the *least* confusing
thing in the documentation.  If it were called z3k_u4y/2
that would increase my confusion very little.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: string:lexeme/s2 - an old man's rant

Lloyd R. Prentice-2
Hi Richard,

My head spins.

I’m ashamed to say that I’m functionally illiterate in every natural language of the world except English— and I’m still working at mastering that. And I program for English speakers.

So, I guess I’ll stick with re:split/3 until I have a pressing need for Unicode. Maybe by then the issues will be ironed out— and better, well tucked under the hood.

Richard, you’re a star in the Erlang firmament. 

Thank you,

Lloyd

Sent from my iPad

On May 7, 2019, at 8:39 PM, Richard O'Keefe <[hidden email]> wrote:

Let's look at the documentation for tokens/2:

http://erlang.org/doc/man/string.html#tokens-2

The first thing I notice is that we are told *that*
the function is obsolete but not *why* it is, and
that's important.

The second thing I notice is that we are told
to use lexemes/2 instead, but we are not told *how*
to do that.  An example showing an old call and its
new equivalent would do wonders.

The third thing I notice is the reason that the
second thing matters.  Consider the following
examples:
  tokens("aaa", "x") => ["aaa"]
  tokens("aa", "x")  => ["aa"]
  tokens("a", "x")   => ["a"]
so by continuity we expect
  tokens("", "x")    => [""]
BUT the result is actually [].  True, the
description says that the result is a list
of non-empty strings, but I don't really see
why that is so important that our natural
expectation that tokens(S, [X]) => [X]
whenever S is *any* string not containing X
should be violated, and if it is, then I
would definitely expect an exception.

The fourth thing I notice is that the treatment
of multi-element separator lists is odd.  I have
had occasion to use separators with more than
one code-point, and for Unicode that could be
essential.  I have also had occasion to use
split at C1, then at C2, then at C3, then at C4, ...
I've also had occasion to split on one separator
and then split the pieces into smaller pieces,
so multiple levels of splitting.  (Think of
/etc/passwd for a simple example.)  But the only
time I ever want multiple *alternative* separators
is when asking for white-space separation, and
*that* is when I want non-empty pieces.  It is
also the only time I ever want separators coalesced.
Given a string like "||x|yy||w" and the separator
"|", I've always wanted ["","","x","yy","","w"]
as the answer.  But there's a particular point
here:  which of us knows off-hand just what all
the Zs, Zl, and Zp characters of Unicode actually
are?  It would make a *lot* of sense to have
   tokens(String) -> list of non-empty pieces
   tokens(String, Sep) -> list of possibly empty
     pieces separated by the non-empty substring Sep.

The fifth thing I notice is that there is no
specification of what happens if SeparatorList is
empty.

All things considered, this is a function I am never
going to use, because it is less work to write my own
than to try to figure out this documentation.  And I
had to look at the code to figure some of it out.

I get seriously confused by some of the code in
string.erl.  We find
%% Fetch first grapheme cluster ..
next_grapheme(CD) -> ..
Which is it?  Grapheme or grapheme cluster?  These
are *different* (but overlapping) things!  And
where is the locale argument so that the function
knows what a "user-perceived character" actually *is*?
How come an empty list counts as a grapheme_cluster()?

What if I have something like
"foo:bar::uggle::zoosh" and I want to split it at
"::" but NOT at ":"?  "::" is not a grapheme cluster,'
so it looks like neither of these functions will help
me.

Writing good documentation is HARD.  At dear departed
Quintus, we started with a full time technical writer
and expanded to three, nearly as many as developers.

The *name* 'lexemes' is arguably the *least* confusing
thing in the documentation.  If it were called z3k_u4y/2
that would increase my confusion very little.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: string:lexeme/s2 - an old man's rant

empro2
On Wed, 8 May 2019 00:27:41 -0400
"Lloyd R. Prentice" <[hidden email]> wrote:

> Richard, you’re a star in the Erlang firmament.

Not at all limited to Erlang, not even to all the
languages he has ever mentioned; so it is at least the
programming languages firmament ... oh, and the unicode
firmament, it appears ...

I sometimes wonder whether I should eat more kiwis, sorry!
kiwi fruit, of course ... then again ... ;-)

~Michael

--

If a *bank* in need of money is systematically important,
then that system is not important.










_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: string:lexeme/s2 - an old man's rant

empro2
In reply to this post by Richard O'Keefe
On Wed, 8 May 2019 12:39:55 +1200
"Richard O'Keefe" <[hidden email]> wrote:

> All things considered, this is a function I am never
> going to use, because it is less work to write my own
> than to try to figure out this documentation.  And I
> had to look at the code to figure some of it out.

Glad to hear! With all this much appreciated precious
effort gone and going into Erlang (including its
documentation), I was not sure whether it was merely my own
personal stupidity. Well, it still might be, as it is not
only `tokens/2` that takes me too far away from coding
and designing.


> I get seriously confused by some of the code in
> string.erl.

Who is supposed to believe that?! ;-)
Seriously, that would make me shut down everything and take
extensive walks before reworking everything ...


> Writing good documentation is HARD.  At dear departed
> Quintus, we started with a full time technical writer
> and expanded to three, nearly as many as developers.

"Limited resources" was one of the first items on my list.
The link in this anniversary posting

<http://erlang.org/pipermail/erlang-questions/2017-December/094396.html>

proved that I had correctly guessed all the points
mentioned in it.

My basic intention is to somehow generate so much
community help that the Erlang/OTP team gets so bored that
they go and replace records with frames :-)

"What is to become of my life, dreaming days away ...?"

~Michael

--

“Even after a thousand explanations a fool is no wiser,
whereas someone intelligent requires only one fourth of
these.”

        – from the Mahābhārata (महाभारत)












_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: string:lexeme/s2 - an old man's rant

zxq9-2
In reply to this post by Richard O'Keefe
On 2019年5月8日水曜日 10時53分25秒 JST Richard O'Keefe wrote:
> For what it's worth, in Unicode, Line Separator and Paragraph
> Separator are the recommended characters, with CR, LF, CR+LF,
> and of arguably NEL (U+0085) being "legacy".
>
> Again for what it's worth, Unicode defines an algorithm for
> breaking text into word( token)s.

I don't really mind the term "lexeme", but I've wondered why the
existing tokens/2 function wasn't simply updated to work the way
lexemes/2 works.

If we needed a new function, it seems the name "tokenize/2" might
have been an easier mental adjustment.

But anyway, naming things is hard and... meh. For me the unicode
enhancements are a big enough deal that I could *almost* care less
what they are called.

That said, who isn't going to open a new language's string lib and
expect to find things called "split" "tokenize"/"tokens", "clean",
"right", "left", "pad", etc.?

-Craig
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: string:lexeme/s2 - an old man's rant

empro2
On Wed, 08 May 2019 23:18:17 +0900
[hidden email] wrote:

> On 2019年5月8日水曜日 10時53分25秒 JST Richard O'Keefe
> wrote:
> > Again for what it's worth, Unicode defines an algorithm
> > for breaking text into word( token)s.
>
> I don't really mind the term "lexeme",

What if someone wants to break their "text" into
chippy-choppy-parts and neither words, nor tokens, nor
lexemes? and even if their parts could be called one of
these in some context they do not want to know?


> If we needed a new function, it seems the name
> "tokenize/2" might have been an easier mental adjustment.

Oh, the dark side is not stronger, but easier, more
seductive ... In a functional language I want names
describe the result not the action producing that result.
Imperative send() and fwrite() are fine because they are no
functions (mapping args to result) but mere procedures that
happen to return something more or less useful. No wonder
that send() is Prolog's exclamation mark: no backtracking
across a message sent.


> But anyway, naming things is hard

Actually it is *much* too easy, but finding *good*
names, good in all, or at least the most important
contexts ...
But I am sure that is what you mean.


> unicode enhancements are a big enough deal that I could
> *almost* care less what they are called.

As I have buried in my much ignored "Help" posting:
The complexity saved in the small, unimportant (each on its
own) but numerous details frees capacity for complexity on
the higher levels. So "almost" could be too little ...


> That said, who isn't going to open a new language's
> string lib and expect to find things called "split"

*I* do not want other languages in Erlang, lest someone
says "Hej, look at this great PHP lib ...". Of course,
Erlang cannot be a completely new kind of wheel, but I
wonder whether it is running faster than it can
without stumbling, recently, unnecessarily(?) trying to
catch up to something ...

~Michael

--

Reasonable is that which cannot be criticised reasonably.











_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: string:lexeme/s2 - an old man's rant

Richard O'Keefe
In reply to this post by empro2
Do not eat a lot of Chinese Gooseberries (the fruit whose name was changed by
the company Turners and Growers for marketing reasons, back before Ping-Pong
Diplomacy) unless you want to lose weight by running frequently to the small
room.  Actinidia deliciosa wants its fruit eaten but not fully digested, for
the sake of the seeds.
explains the name change.  I still use the old name.)


To be honest, the first time I saw the function name 'tokens', I expected
something returning *Erlang* tokens, then when I saw 'lexemes', I said to
myself "NOW they have a function that does what I thought tokens did".
Wrong again.

Here's an apropos example from the ANSI Smalltalk standard.

aString subStrings: separators

The first thing to note is that this actually violates Smalltalk naming
conventions: internal capitals are only to be used at *word* boundaries,
not *morpheme* boundaries.  And 'sub-' as used here is a prefix, not a
word.  Some Smalltalk implementations have changed it to 'substrings'.
One has renamed it to
aString asCollectionOfSubstringsSeparatedByAnyOf: separators

The second thing to note is that the description given for it is
hopelessly vague.  If anyone thinks that the Erlang library documentation
needs improving -- as I do -- at least it isn't an actual *standard*!
This operation does pretty much the same thing as string:tokens/2, but
you would never guess it from the text in the standard.  Oh, did I
mention that the Erlang documentation is free but the ANSI Smalltalk
standard is not?  The Erlang documentation is definitely value for money.

One of these days I must really ask to be allowed to edit some of the
Erlang documentation, but I'm afraid that if I do people will discover
that I'm better at criticising than writing.


On Wed, 8 May 2019 at 22:36, <[hidden email]> wrote:
On Wed, 8 May 2019 00:27:41 -0400
"Lloyd R. Prentice" <[hidden email]> wrote:

> Richard, you’re a star in the Erlang firmament.

Not at all limited to Erlang, not even to all the
languages he has ever mentioned; so it is at least the
programming languages firmament ... oh, and the unicode
firmament, it appears ...

I sometimes wonder whether I should eat more kiwis, sorry!
kiwi fruit, of course ... then again ... ;-)

~Michael

--

If a *bank* in need of money is systematically important,
then that system is not important.










_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: string:lexeme/s2 - an old man's rant

empro2
In reply to this post by Richard O'Keefe
On Wed, 8 May 2019 10:53:25 +1200
"Richard O'Keefe" <[hidden email]> wrote:

> For what it's worth, in Unicode, Line Separator and
> Paragraph Separator are the recommended characters, with
> CR, LF, CR+LF, and of arguably NEL (U+0085) being
> "legacy".

Does that matter in a function not called `uc_lines` or
such?


> Again for what it's worth, Unicode defines an algorithm
> for breaking text into word( token)s.

"a=1&b=2&c=me+tomorrow"

"b=2" is no word, would UC call that a "token"? and if so,
would or should that matter to the user?

I would say that UC *is* an algorithm and no mere encoding
anymore. My impression is it has taken some wrong turns and
is now rolling down some strange hill driven by its mere
weight. It seems to push any available metadata onto each
character and then disunifies them in a way that makes
every single character reflect half of its context without
ever asking: Who is ever going to enter all these
correctly? even if the glyphs for those in a font happen to
be distinct. In this context I often picture a Norwegian
professor at the blackboard writing in Norsk (ø, Ø) about
empty sets (∅) and average (⌀) diameters (⌀) .... And then
again it does not, as 'average' and 'diameter' are the
same ...

And combining stuff requiring "canonisation"(?) and
allowing funny things like "◌̈ø" and utter rubbish "◌̈å" ...

And please tell the Maaori to replace "wh" with "f", "ng"
with "g" and, as some already do, macrons with double
vocals. ;-)

Simplify, simplify, simplify; things gets complex more than
enough all on their own ... no, wait, things have already
got ...

~Michael

--

That which was said, is not that which was spoken,
but that which was understood; and none of these
comes necessarily close to that which was meant.












_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: string:lexeme/s2 - an old man's rant

Richard O'Keefe
In reply to this post by zxq9-2
Who isn't going to expect 'split' 'tokenize'/'tokens' 'clean' 'right' 'left' 'pad'?
A Ruby programmer will recognise 'split' from that list but nothing else.
An SML programmer will recognise 'tokens' but nothing else.
A Haskell programmer will wonder whether left/right correspond to justifyLeft/
  justifyRight and if so, which way around.  'split' might be OK, if only I knew'
  what you expect it to do.
An F# or C# programmer will wonder whether left/right correspond to padLeft/
  padRight and if so, which way around.  As for 'split', which of the 10
  methods by that name did you have in mind?
A PL/I programmer will expect 'left' and 'right' to correspond to LEFT and RIGHT
  (or possibly the other way around, depending on whether the focus is where the
  *string* goes or where the *padding* goes).  The others will be a complete mystery.
A Simula programmer won't have a clue what any of these are and will be disappointed
  by strings that don't have a movable cursor.
An OCaml programmer will hope that 'split' is related to 'split_on_char' but will
  not have any idea what the other functions are.
A Python programmer may be surprised that 'split' is actually 'tokens'.

And so it goes.  What *does* 'clean' do?




On Thu, 9 May 2019 at 02:18, <[hidden email]> wrote:
On 2019年5月8日水曜日 10時53分25秒 JST Richard O'Keefe wrote:
> For what it's worth, in Unicode, Line Separator and Paragraph
> Separator are the recommended characters, with CR, LF, CR+LF,
> and of arguably NEL (U+0085) being "legacy".
>
> Again for what it's worth, Unicode defines an algorithm for
> breaking text into word( token)s.

I don't really mind the term "lexeme", but I've wondered why the
existing tokens/2 function wasn't simply updated to work the way
lexemes/2 works.

If we needed a new function, it seems the name "tokenize/2" might
have been an easier mental adjustment.

But anyway, naming things is hard and... meh. For me the unicode
enhancements are a big enough deal that I could *almost* care less
what they are called.

That said, who isn't going to open a new language's string lib and
expect to find things called "split" "tokenize"/"tokens", "clean",
"right", "left", "pad", etc.?

-Craig
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: string:lexeme/s2 - an old man's rant

empro2
In reply to this post by Richard O'Keefe
On Thu, 9 May 2019 23:34:00 +1200
"Richard O'Keefe" <[hidden email]> wrote:

> To be honest, the first time I saw the function name
> 'tokens', I expected something returning *Erlang* tokens

That is what I meant by "foisting meaning onto substrings".


> aString subStrings: separators
>
> The first thing to note is that this actually violates
> Smalltalk naming conventions:

First thing I noted was: they also allow more than one
separator at once. Against a performance argument I weigh
simplicity, all the more so if separating substrings
would allow interdependencies (a simple hint at list
order might suffice to prevent such).


> 'substrings'. One has renamed it to aString
> asCollectionOfSubstringsSeparatedByAnyOf: separators

Reminds me of my contextfreely-descriptive names of 20
years ago ...


> One of these days I must really ask to be allowed to edit
> some of the Erlang documentation, but I'm afraid that if
> I do people will discover that I'm better at criticising
> than writing.

That is why I would like to bring in the host of fools like
myself, who do not yet "know what is meant" but have
workforce in numbers. We have to read it anyway and only at
this point we remember what is "useful" and the questions
that come up; but I think it unlikely people will go dig up
source and push pull forget what they were doing. I cannot
work that way (possibly lack of git routine), and I cannot
stand to ignore all those opportunities to improve
something good. So I cannot get on with my rubbish code,
hit the git-index-stage-cached-confusion, which starts
all this over again, and ... forget what I wanted to
do ... :-)

~Michael

--

Satire is the opium of the intellectual.












_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: string:lexeme/s2 - an old man's rant

Lloyd R. Prentice-2
Hi All,

In the grumpy moment when I launched this thread I never expected so many interesting insights and comments.  Now I want to step back and convey boundless gratitude to all who have worked so hard to develop and document Erlang. Thank you.

It’s clear that Erlang, the language, and official documentation are living entities— subject to changes that risk breaking legacy code and programmer expectations and mind set.

As I understand it, string processing was not so important in telecom applications. But today we see Erlang in chat and web applications where processing natural language is  fundamental.  So pragmatic questions:

— Is the string library sufficiently up to the task in these new domains?
— Can it be improved?
— If so, how?
— Is there an official process through which these questions can be answered and targeted improvements brought to pass?

All the best,

Lloyd

Sent from my iPad

> On May 9, 2019, at 9:31 AM, <[hidden email]> <[hidden email]> wrote:
>
> On Thu, 9 May 2019 23:34:00 +1200
> "Richard O'Keefe" <[hidden email]> wrote:
>
>> To be honest, the first time I saw the function name
>> 'tokens', I expected something returning *Erlang* tokens
>
> That is what I meant by "foisting meaning onto substrings".
>
>
>> aString subStrings: separators
>>
>> The first thing to note is that this actually violates
>> Smalltalk naming conventions:
>
> First thing I noted was: they also allow more than one
> separator at once. Against a performance argument I weigh
> simplicity, all the more so if separating substrings
> would allow interdependencies (a simple hint at list
> order might suffice to prevent such).
>
>
>> 'substrings'. One has renamed it to aString
>> asCollectionOfSubstringsSeparatedByAnyOf: separators
>
> Reminds me of my contextfreely-descriptive names of 20
> years ago ...
>
>
>> One of these days I must really ask to be allowed to edit
>> some of the Erlang documentation, but I'm afraid that if
>> I do people will discover that I'm better at criticising
>> than writing.
>
> That is why I would like to bring in the host of fools like
> myself, who do not yet "know what is meant" but have
> workforce in numbers. We have to read it anyway and only at
> this point we remember what is "useful" and the questions
> that come up; but I think it unlikely people will go dig up
> source and push pull forget what they were doing. I cannot
> work that way (possibly lack of git routine), and I cannot
> stand to ignore all those opportunities to improve
> something good. So I cannot get on with my rubbish code,
> hit the git-index-stage-cached-confusion, which starts
> all this over again, and ... forget what I wanted to
> do ... :-)
>
> ~Michael
>
> --
>
> Satire is the opium of the intellectual.
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions