byte() vs. char() use in documentation

classic Classic list List threaded Threaded
39 messages Options
12
Reply | Threaded
Open this post in threaded view
|

byte() vs. char() use in documentation

Kostis Sagonas-2
In the Erlang documentation, the language of types and specs makes a
clear distinction between the following two types:

     byte() :: 0..255
     char() :: 0..16#10ffff

See http://erlang.org/doc/reference_manual/typespec.html#id72693

I think that nowadays there are very good reasons to have this distinction.


In trying to fix a bug today, I happened to notice that some key types
of Erlang are inconsistent with this view in the Erlang/OTP
documentation (In http://erlang.org/doc/man/erlang.html), most notably:

     iolist() :: [char() | binary() | iolist()]

   binary_to_list(Binary) -> [char()]
   binary_to_list(Binary, Start, Stop) -> [char()]
   bitstring_to_list(Bitstring) -> [char()|bitstring()]

and:

     BitstringList :: [BitstringList | bitstring() | char()]

which actually triggered this mail.

I think all the occurrences of char() above should read byte() instead.
Right?

If yes, could somebody at OTP (or some kind volunteer) please clean up
this mess?  (I can provide a fix for the documentation of the 'erlang'
module if you want me to.)

Kostis
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: byte() vs. char() use in documentation

James Churchman
more of a question than an actual answer, but in erlang can erlang strings ( therefore io-lists) be utf-16?

I assume that binaries are obviously only ever utf8 representation, but a list of ints can obviously exceed number above 255..

so maybe (??) the answer is

a) iolist CAN be a  char() (.. this is surely especially true if the data is only being messages threw erlang from other systems)

b) the binary to list are a bit less easy

basically it can't be a char(), because it will always have started off as an 8bit ( utf8 ) representation so it will always come back as a list of byte() but in the general case, it's returning an io-list and that can be a char()

is this correct? and in that case does that make the bif's xml doc file in fact correct?

James

On 28 Apr 2011, at 17:26, Kostis Sagonas wrote:

> In the Erlang documentation, the language of types and specs makes a clear distinction between the following two types:
>
>    byte() :: 0..255
>    char() :: 0..16#10ffff
>
> See http://erlang.org/doc/reference_manual/typespec.html#id72693
>
> I think that nowadays there are very good reasons to have this distinction.
>
>
> In trying to fix a bug today, I happened to notice that some key types of Erlang are inconsistent with this view in the Erlang/OTP documentation (In http://erlang.org/doc/man/erlang.html), most notably:
>
>    iolist() :: [char() | binary() | iolist()]
>
>  binary_to_list(Binary) -> [char()]
>  binary_to_list(Binary, Start, Stop) -> [char()]
>  bitstring_to_list(Bitstring) -> [char()|bitstring()]
>
> and:
>
>    BitstringList :: [BitstringList | bitstring() | char()]
>
> which actually triggered this mail.
>
> I think all the occurrences of char() above should read byte() instead.
> Right?
>
> If yes, could somebody at OTP (or some kind volunteer) please clean up this mess?  (I can provide a fix for the documentation of the 'erlang' module if you want me to.)
>
> Kostis
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: byte() vs. char() use in documentation

Masklinn
On 2 mai 2011, at 01:01, James Churchman <[hidden email]> wrote:
> more of a question than an actual answer, but in erlang can erlang strings ( therefore io-lists) be utf-16?
>
The 16#10ffff upper bound indicates iolists are likely encoded in UCS-4.
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: byte() vs. char() use in documentation

Raimo Niskanen-2
In reply to this post by James Churchman
On Mon, May 02, 2011 at 12:01:49AM +0100, James Churchman wrote:
> more of a question than an actual answer, but in erlang can erlang strings ( therefore io-lists) be utf-16?

A string is a list of unicode code points.

An IO-list is a list of binaries or bytes.

>
> I assume that binaries are obviously only ever utf8 representation, but a list of ints can obviously exceed number above 255..

You can choose your binary representation. See erlang man page unicode(3).

>
> so maybe (??) the answer is
>
> a) iolist CAN be a  char() (.. this is surely especially true if the data is only being messages threw erlang from other systems)

No. byte().

>
> b) the binary to list are a bit less easy

Compare erlang:binary_to_list/1 and erlang:list_to_binary/1 with the
corresponding functions in module 'unicode'.

>
> basically it can't be a char(), because it will always have started off as an 8bit ( utf8 ) representation so it will always come back as a list of byte() but in the general case, it's returning an io-list and that can be a char()
>
> is this correct? and in that case does that make the bif's xml doc file in fact correct?

The documentation is incorrect. Once there was no difference between char()
and byte(). char() ment a ISO-8859-1 character which is the same
size as byte().


>
> James
>
> On 28 Apr 2011, at 17:26, Kostis Sagonas wrote:
>
> > In the Erlang documentation, the language of types and specs makes a clear distinction between the following two types:
> >
> >    byte() :: 0..255
> >    char() :: 0..16#10ffff
> >
> > See http://erlang.org/doc/reference_manual/typespec.html#id72693
> >
> > I think that nowadays there are very good reasons to have this distinction.
> >
> >
> > In trying to fix a bug today, I happened to notice that some key types of Erlang are inconsistent with this view in the Erlang/OTP documentation (In http://erlang.org/doc/man/erlang.html), most notably:
> >
> >    iolist() :: [char() | binary() | iolist()]
> >
> >  binary_to_list(Binary) -> [char()]
> >  binary_to_list(Binary, Start, Stop) -> [char()]
> >  bitstring_to_list(Bitstring) -> [char()|bitstring()]
> >
> > and:
> >
> >    BitstringList :: [BitstringList | bitstring() | char()]
> >
> > which actually triggered this mail.
> >
> > I think all the occurrences of char() above should read byte() instead.
> > Right?
> >
> > If yes, could somebody at OTP (or some kind volunteer) please clean up this mess?  (I can provide a fix for the documentation of the 'erlang' module if you want me to.)
> >
> > Kostis
> > _______________________________________________
> > erlang-questions mailing list
> > [hidden email]
> > http://erlang.org/mailman/listinfo/erlang-questions
>
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions

--

/ Raimo Niskanen, Erlang/OTP, Ericsson AB
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

byte() vs. char() use in documentation

Robert Virding-5

----- "Raimo Niskanen" <raimo+erlang-questions> wrote:

> On Mon, May 02, 2011 at 12:01:49AM +0100, James Churchman wrote:
> > more of a question than an actual answer, but in erlang can erlang
> strings ( therefore io-lists) be utf-16?
>
> A string is a list of unicode code points.
>
> An IO-list is a list of binaries or bytes.
>
> >
> > I assume that binaries are obviously only ever utf8 representation,
> but a list of ints can obviously exceed number above 255..
>
> You can choose your binary representation. See erlang man page
> unicode(3).
>
> >
> > so maybe (??) the answer is
> >
> > a) iolist CAN be a  char() (.. this is surely especially true if the
> data is only being messages threw erlang from other systems)
>
> No. byte().

As a string is a list of unicode code points and an iolist can contain a string then its type must also be char().

Robert


Reply | Threaded
Open this post in threaded view
|

Re: byte() vs. char() use in documentation

Raimo Niskanen-2
On Mon, May 02, 2011 at 09:35:18AM +0000, Robert Virding wrote:

>
> ----- "Raimo Niskanen" <[hidden email]> wrote:
>
> > On Mon, May 02, 2011 at 12:01:49AM +0100, James Churchman wrote:
> > > more of a question than an actual answer, but in erlang can erlang
> > strings ( therefore io-lists) be utf-16?
> >
> > A string is a list of unicode code points.
> >
> > An IO-list is a list of binaries or bytes.
> >
> > >
> > > I assume that binaries are obviously only ever utf8 representation,
> > but a list of ints can obviously exceed number above 255..
> >
> > You can choose your binary representation. See erlang man page
> > unicode(3).
> >
> > >
> > > so maybe (??) the answer is
> > >
> > > a) iolist CAN be a  char() (.. this is surely especially true if the
> > data is only being messages threw erlang from other systems)
> >
> > No. byte().
>
> As a string is a list of unicode code points and an iolist can contain a string then its type must also be char().

No. As it stands now a string is a list of unicode code points and
can not be contained in an iolist.

This became messy when char() was re-defined from latin-1 character
to unicode character. That affected string() that affected iolist()
and the latter was incorrect.

We must clean up the mess. Either by completing the notion of char()
being unicode and hence rewriting iolist() to contain byte() and binary(),
or by reverting to char() being latin-1 char and using unicode:char()
and unicode:string() where that is correct...

>
> Robert

--

/ Raimo Niskanen, Erlang/OTP, Ericsson AB
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: byte() vs. char() use in documentation

Kostis Sagonas-2
Raimo Niskanen wrote:
>
> This became messy when char() was re-defined from latin-1 character
> to unicode character. That affected string() that affected iolist()
> and the latter was incorrect.
>
> We must clean up the mess.

Right.  The sooner it happens the better it is.

> ... Either by completing the notion of char()
> being unicode and hence rewriting iolist() to contain byte() and binary(),
> or by reverting to char() being latin-1 char and using unicode:char()
> and unicode:string() where that is correct...

Please, by all means do the former.  The latter will only cause havoc
everywhere.  For starters, I do not see any need in having two different
basic types (byte() and char()) denoting (pretty much) the same thing.
The only thing this does is cause unnecessary confusion to newcomers
(and apparently to some old-timers too).  Second, if you choose the
latter you will eventually have to change lots of type inference code,
because I promise you I will not do this, and believe me you don't want
to go there... (The Vietnam jungle is probably a friendlier place ;) )

Cheers,
Kostis
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: byte() vs. char() use in documentation

James Churchman
So just for my own understanding, and as it seems extremely important ( strings are quite important these days!), as it stands now:

iolists cant can only ( officially?) contain utf8? ( as no utf8 code point will exceed  255, like latin1 / asci, and are therefor are all byte() )

strings can be of utf8 utf16 or utf32, but only the utf8 version is allowed in an iolist? ( and therefore if you wanted an "iolist" ( eg a non flat list of chars) that contained utf 16 or 32 code points you would have to stick exclusively to lists ( strings) and not binaries and use lists:flatten before you finished with it, to remove all the nested lists )

binaries can be of any unicode type..

also there does seem to be a needed distinction between char() and byte() as they are not the same at all, but the documentation is wrong as at the moment iolists can infact only contain byte() not char()

the suggested direction is to repair the docs so that they specify only allowing 0~255 ints( byte() ) in iolists rather than allowing io-lists to contain any string as they did before the introduction of unicode / in the days of latin1 etc.. ?


i think that that goes agents most ( even erlang implementers :-) ) opinion of what an iolist is ( that being a list of any valid string or binary) but maybe ( to raise a totally different problem) would prevent the possibility of an iolist having a mixed unicode type and still begin "valid" ( even tho i guess this is still possible as binaries can in fact be other utf representations)



On 2 May 2011 11:13, Kostis Sagonas <[hidden email]> wrote:
Raimo Niskanen wrote:

This became messy when char() was re-defined from latin-1 character
to unicode character. That affected string() that affected iolist()
and the latter was incorrect.

We must clean up the mess.

Right.  The sooner it happens the better it is.

... Either by completing the notion of char()

being unicode and hence rewriting iolist() to contain byte() and binary(),
or by reverting to char() being latin-1 char and using unicode:char()
and unicode:string() where that is correct...

Please, by all means do the former.  The latter will only cause havoc everywhere.  For starters, I do not see any need in having two different basic types (byte() and char()) denoting (pretty much) the same thing. The only thing this does is cause unnecessary confusion to newcomers (and apparently to some old-timers too).  Second, if you choose the latter you will eventually have to change lots of type inference code, because I promise you I will not do this, and believe me you don't want to go there... (The Vietnam jungle is probably a friendlier place ;) )

Cheers,
Kostis

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: byte() vs. char() use in documentation

Richard A. O'Keefe-2

On 3/05/2011, at 7:43 AM, James Churchman wrote:
> strings can be of utf8 utf16 or utf32,

No.  The model for strings is "one list element = one unicode character",
and both UTF-8 and UTF-16 violate that.

A list of ASCII code-points is both a (Unicode) string and an iolist.

Of course, nothing stops you holding an abstract string as a list of
octets using UTF-8 (or for that matter, UTF-EBCDIC) or as a list of
16-bit units using UTF-16.  It's just that if you do so, what you
have doesn't count as an Erlang string any more (outside ASCII).

> also there does seem to be a needed distinction between char() and byte() as they are not the same at all, but the documentation is wrong as at the moment iolists can infact only contain byte() not char()

yes.


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: byte() vs. char() use in documentation

Raimo Niskanen-2
In reply to this post by James Churchman
On Mon, May 02, 2011 at 08:43:33PM +0100, James Churchman wrote:
> So just for my own understanding, and as it seems extremely important (
> strings are quite important these days!), as it stands now:
>
> iolists cant can only ( officially?) contain utf8? ( as no utf8 code point
> will exceed  255, like latin1 / asci, and are therefor are all byte() )

Richard O'Keefe explained this nicely, I'll just elaborate.

iolists were when introduced for handling byte sequences,
not having to copy but just building them in nested form,
either from individual bytes or from binaries.

Back then characters were only latin-1 hence matched bytes nicely
and therefore iolists. This is no longer true. Now you will
have to do a translation from a seqence of characters into the
corresponding byte sequence in an iolist. The preferred
representation in erlang is utf-8 since it is the default
for e.g the unicode module and for the ~t modifier in the
io module when printing strings

>
> strings can be of utf8 utf16 or utf32, but only the utf8 version is allowed

The programmer should regard strings as a sequence of unicode code points.
As such they are just that and there is no encoding to bother about.
The code point number uniquely defines which unicode character it is.

> in an iolist? ( and therefore if you wanted an "iolist" ( eg a non flat list

UTF-8 is not the only encoding allowed in an iolist. You can do any encoding
as you desire. If you use the unicode module the default format for encoding
and decoding of binaries is utf8, but utf16 or utf32 big or little endian
is easy to do. An iolist is just a sequence of bytes.

> of chars) that contained utf 16 or 32 code points you would have to stick

An iolist is a non-flat list of bytes. Do not mixup bytes with characters.

> exclusively to lists ( strings) and not binaries and use lists:flatten

You can not just use lists:flatten on a unicode character string to get
an iolist. The Unicode code points > 255 are still there. You will
have to encode the unicode characters into a suitable byte representation
e.g using the unicode module.

> before you finished with it, to remove all the nested lists )
>
> binaries can be of any unicode type..

Binaries are sequences of bytes. Period. You decide what they mean.

>
> also there does seem to be a needed distinction between char() and byte() as
> they are not the same at all, but the documentation is wrong as at the
> moment iolists can infact only contain byte() not char()

Yes.

>
> the suggested direction is to repair the docs so that they specify only
> allowing 0~255 ints( byte() ) in iolists rather than allowing io-lists to
> contain any string as they did before the introduction of unicode / in the
> days of latin1 etc.. ?

Yes. That iolists could contain any string was by accident since there
were no characters > 255 in the days of latin1. Since iolists is about
sequences of bytes they can not be fixed into being allowed to contain
any character. For that to be possible you would have to define the
byte encoding for iolists, or store the byte encoding with a particular
iolist. Since there is so many byte encodings that are used it is
better to make this visible to the programmer so he/she is forced to
understand the byte encoding problem and to handle it explicitly.
Therefore is iolists now as the always were secuences of bytes (8-bit).
And that is all.

>
>
> i think that that goes agents most ( even erlang implementers :-) ) opinion
> of what an iolist is ( that being a list of any valid string or binary) but

I think not.

An iolist is any valid byte or binary sequence. Binaries are sequences
of bytes. They are all about bytes.

Characters and strings are today vastly more complex beasts than they
were when US-ASCII and later ISO-LATIN-1 was the norm. This must
be visible to the programmer.

> maybe ( to raise a totally different problem) would prevent the possibility
> of an iolist having a mixed unicode type and still begin "valid" ( even tho
> i guess this is still possible as binaries can in fact be other utf
> representations)

I repeat again. The programmer decides what the bytes mean. The list
[0,0,16#21,16#2b] e.g would mean "angstrom sign" if the encoding is
UTF-32 big endian. And that is a valid iolist.
But [16#212b] is not.

>
>
>
> On 2 May 2011 11:13, Kostis Sagonas <[hidden email]> wrote:
>
> > Raimo Niskanen wrote:
> >
> >>
> >> This became messy when char() was re-defined from latin-1 character
> >> to unicode character. That affected string() that affected iolist()
> >> and the latter was incorrect.
> >>
> >> We must clean up the mess.
> >>
> >
> > Right.  The sooner it happens the better it is.
> >
> >  ... Either by completing the notion of char()
> >>
> >> being unicode and hence rewriting iolist() to contain byte() and binary(),
> >> or by reverting to char() being latin-1 char and using unicode:char()
> >> and unicode:string() where that is correct...
> >>
> >
> > Please, by all means do the former.  The latter will only cause havoc
> > everywhere.  For starters, I do not see any need in having two different
> > basic types (byte() and char()) denoting (pretty much) the same thing. The
> > only thing this does is cause unnecessary confusion to newcomers (and
> > apparently to some old-timers too).  Second, if you choose the latter you
> > will eventually have to change lots of type inference code, because I
> > promise you I will not do this, and believe me you don't want to go there...
> > (The Vietnam jungle is probably a friendlier place ;) )
> >
> > Cheers,
> > Kostis
> >
> > _______________________________________________
> > erlang-questions mailing list
> > [hidden email]
> > http://erlang.org/mailman/listinfo/erlang-questions
> >

> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions


--

/ Raimo Niskanen, Erlang/OTP, Ericsson AB
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: byte() vs. char() use in documentation

Anthony Shipman
On Tue, 3 May 2011 07:45:49 pm Raimo Niskanen wrote:
> The programmer should regard strings as a sequence of unicode code points.
> As such they are just that and there is no encoding to bother about.
> The code point number uniquely defines which unicode character it is.

As I recall, a Unicode character can be composed of up to 7 code points.
To quote a text book I'm looking at now:
-------------
The trick is, again, to disabuse yourself of the idea that a one-to-one
correspondence exists between "characters" as the user is used to thinking of
them and code points (or code units) in the backing store. Unicode uses the
term "character" to mean more or less "the entity that's represented by a
single Unicode code point," but this concept doesn't always match the user's
definition of "character".
-------------

I think a more complete design would represent a character as a binary that is
a UTF8 encoding of its code points. A string would then be a deep list of
these binaries.

--
Anthony Shipman                    Mamas don't let your babies
[hidden email]                   grow up to be outsourced.
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: byte() vs. char() use in documentation

dmercer
In reply to this post by Raimo Niskanen-2
On Tuesday, May 03, 2011, Raimo Niskanen wrote:

> I repeat again. The programmer decides what the bytes mean. The list
> [0,0,16#21,16#2b] e.g would mean "angstrom sign" if the encoding is
> UTF-32 big endian. And that is a valid iolist.
> But [16#212b] is not.

Out of curiosity, why does

        unicode:characters_to_binary([16#212b], {utf32, big}).

return the UTF-8 representation of Å (Angstrom sign) and not the big-endian
UTF-32 like I expected?

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: byte() vs. char() use in documentation

dmercer
In reply to this post by Anthony Shipman
On Tuesday, May 03, 2011, Anthony Shipman wrote:

> I think a more complete design would represent a character as a binary
> that is
> a UTF8 encoding of its code points. A string would then be a deep list
> of
> these binaries.

How is that superior than representing a character by a single integer
representing the Unicode codepoint, a string by a list of characters?  You
can always use unicode:characters_to_binary/1 to convert to a UTF-8 binary
if you wish.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: byte() vs. char() use in documentation

Raimo Niskanen-2
In reply to this post by Anthony Shipman
On Wed, May 04, 2011 at 05:33:58AM +1000, Anthony Shipman wrote:

> On Tue, 3 May 2011 07:45:49 pm Raimo Niskanen wrote:
> > The programmer should regard strings as a sequence of unicode code points.
> > As such they are just that and there is no encoding to bother about.
> > The code point number uniquely defines which unicode character it is.
>
> As I recall, a Unicode character can be composed of up to 7 code points.
> To quote a text book I'm looking at now:
> -------------
> The trick is, again, to disabuse yourself of the idea that a one-to-one
> correspondence exists between "characters" as the user is used to thinking of
> them and code points (or code units) in the backing store. Unicode uses the
> term "character" to mean more or less "the entity that's represented by a
> single Unicode code point," but this concept doesn't always match the user's
> definition of "character".
> -------------

There seems to be a terminology here clash that I will remember for the future.
When I talked about "Unicode code points" I ment the character number
in the Unicode system. I did not think it was allowed to talk about "code points"
when talking about byte encoded data. There are text books that talk about
"code points (or code units) in the backing store". I find that very confusing.
I will aways call it "byte encoding" or something like that.

>
> I think a more complete design would represent a character as a binary that is
> a UTF8 encoding of its code points. A string would then be a deep list of
> these binaries.
>
> --
> Anthony Shipman                    Mamas don't let your babies
> [hidden email]                   grow up to be outsourced.
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions

--

/ Raimo Niskanen, Erlang/OTP, Ericsson AB
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: byte() vs. char() use in documentation

Raimo Niskanen-2
In reply to this post by dmercer
On Tue, May 03, 2011 at 03:05:24PM -0500, David Mercer wrote:

> On Tuesday, May 03, 2011, Raimo Niskanen wrote:
>
> > I repeat again. The programmer decides what the bytes mean. The list
> > [0,0,16#21,16#2b] e.g would mean "angstrom sign" if the encoding is
> > UTF-32 big endian. And that is a valid iolist.
> > But [16#212b] is not.
>
> Out of curiosity, why does
>
> unicode:characters_to_binary([16#212b], {utf32, big}).

Man page says:
        characters_to_binary(Data,InEncoding) -> binary() | ...

You changed the InEncoding to {utf32,big}.
You want this:
        characters_to_binary(Data, InEncoding, OutEncoding) -> binary() | ...

1> unicode:characters_to_binary([16#212b], unicode, {utf32, big}).
<<0,0,33,43>>

>
> return the UTF-8 representation of ?$B"r (Angstrom sign) and not the big-endian
> UTF-32 like I expected?

InEncoding only applies to binaries in the indata since integers
are just Unicode code points and have no encoding:

2> unicode:characters_to_binary([16#212b,<<226,132,171>>], utf8, {utf32, big}).
<<0,0,33,43,0,0,33,43>>

        Note: unicode is an alias for utf8 in the unicode module
              since utf8 is the default encoding

It is all in the Erlang man page for unicode(3).

--

/ Raimo Niskanen, Erlang/OTP, Ericsson AB
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: byte() vs. char() use in documentation

Masklinn
In reply to this post by Raimo Niskanen-2

On 2011-05-04, at 09:57 , Raimo Niskanen wrote:

> On Wed, May 04, 2011 at 05:33:58AM +1000, Anthony Shipman wrote:
>> On Tue, 3 May 2011 07:45:49 pm Raimo Niskanen wrote:
>>> The programmer should regard strings as a sequence of unicode code points.
>>> As such they are just that and there is no encoding to bother about.
>>> The code point number uniquely defines which unicode character it is.
>>
>> As I recall, a Unicode character can be composed of up to 7 code points.
>> To quote a text book I'm looking at now:
>> -------------
>> The trick is, again, to disabuse yourself of the idea that a one-to-one
>> correspondence exists between "characters" as the user is used to thinking of
>> them and code points (or code units) in the backing store. Unicode uses the
>> term "character" to mean more or less "the entity that's represented by a
>> single Unicode code point," but this concept doesn't always match the user's
>> definition of "character".
>> -------------
>
> There seems to be a terminology here clash that I will remember for the future.
> When I talked about "Unicode code points" I ment the character number
> in the Unicode system. I did not think it was allowed to talk about "code points"
> when talking about byte encoded data.
Well, code points are abstract numbers, but UTF-32 (as far as I know) encodes the
code points as themselves. So many people make the shortcut (furthermore most
people aren't really interested in understanding Unicode — and I can understand
that, it's a drag — so they mix unicode-lingo with "normal" speech leading to
less-than-sensical results).

I believe the issue Anthony mentions here is the difference between glyphs and code
points (combining marks) rather than the difference between code points and
on-disk bytes (resulting from Unicode encoding): a "visible character" (e.g. į̇́)
can be composed of multiple code points, one "base" code point and a number of
combining marks code points (diacritics being the main offender) (nb: the glyph
"į̇́" is, in fact, composed of three code points: U+012F, U+0307 and U+0301).

What most users think of as a character is what unicode calls a glyph: it's the
graphical representation of a group of combined code points (that group may be
unary). Whereas in Unicode, a character is the graphical representation of a
single code point. As a result, a "user" character may be composed of a number
of "unicode" characters.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: byte() vs. char() use in documentation

Anthony Shipman
In reply to this post by dmercer
On Wed, 4 May 2011 06:05:24 am David Mercer wrote:

> On Tuesday, May 03, 2011, Anthony Shipman wrote:
> > I think a more complete design would represent a character as a binary
> > that is
> > a UTF8 encoding of its code points. A string would then be a deep list
> > of
> > these binaries.
>
> How is that superior than representing a character by a single integer
> representing the Unicode codepoint, a string by a list of characters?  You
> can always use unicode:characters_to_binary/1 to convert to a UTF-8 binary
> if you wish.

What we think of as a character, e.g. some letter on a page, can be a
combination of a base component and some combining components. (I use the
word component since I'm not quite sure at the moment exactly what a glyph
means. A component is represented by a code point). Combining components
include accents and a variety of other marks that some languages attach to
the base component.  For example in French the "e-acute" could be represented
as a single code point or as a pair of the code points for "e" and "acute
accent". The standard puts some effort into defining a canonical
representation so that it isn't a total nightmare to tell if two characters
are the same. You have to convert a Unicode string to its canonical form
before you can test for equality.

To fully implement the intent of Unicode we need to talk in terms of
characters, i.e. something you may insert or delete in a word processor,
which may themselves be a sequence of code points which are kept together.

--
Anthony Shipman                    Mamas don't let your babies
[hidden email]                   grow up to be outsourced.
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: byte() vs. char() use in documentation

Richard A. O'Keefe-2
In reply to this post by Anthony Shipman

On 4/05/2011, at 7:33 AM, Anthony Shipman wrote:

> On Tue, 3 May 2011 07:45:49 pm Raimo Niskanen wrote:
>> The programmer should regard strings as a sequence of unicode code points.
>> As such they are just that and there is no encoding to bother about.
>> The code point number uniquely defines which unicode character it is.
>
> As I recall, a Unicode character can be composed of up to 7 code points.

I find this rather confusing.
Here are some official definitions from the Unicode standard:

  Character.
    (1) The smallest component of written language that has semantic value;
        refers to the abstract meaning and/or shape, rather than a specific
        shape (see also glyph), though in code tables some form of visual
        representation is essential for the reader’s understanding.
    (2) Synonym for abstract character.
    (3) The basic unit of encoding for the Unicode character encoding.
    (4) The English name for the ideographic written elements of Chinese origin.
        [See ideograph (2).]

  Coded Character Set.
    A character set in which each character is assigned a numeric code point.
    Frequently abbreviated as character set, charset, or code set;
    the acronym CCS is also used.

  Code Point.
    (1) Any value in the Unicode codespace; that is, the range of integers
        from 0 to 10FFFF(base 16).
        (See definition D10 in Section 3.4, Characters and Encoding.)
    (2) A value, or position, for a character, in any coded character set.

  Code Unit.
    The minimal bit combination that can represent a unit of encoded text
    for processing or interchange.  The Unicode Standard uses 8-bit code units
    in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form,
    and 32-bit code units in the UTF-32 encoding form.
    (See definition D77 in Section 3.9, Unicode Encoding Forms.)

Each Unicode character has *BY DEFINITION* precisely *ONE* code point.
A code point is a number in the range 0 to 1,114,111.

The largest legal Unicode code point (hex 10FFFF) requires precisely
FOUR code units:

11110100 10001111 10111111 10111111      
-----  3 --     6 --     6 --     6

The 11110 prefix on the leading byte says "here are four bytes";
the "10" prefixes on the remaining bytes say "here are 6 more bits".

No Unicode code point requires more than four code units.

> To quote a text book I'm looking at now:
> -------------
> The trick is, again, to disabuse yourself of the idea that a one-to-one
> correspondence exists between "characters" as the user is used to thinking of
> them and code points (or code units) in the backing store.

Sorry, it looks as though you need a better text book.
Code points and code units are NOT the same thing (at least for UTF-8 and
UTF-16).

There IS, by definition, a direct correspondence between Unicode characters
and code points (not every code point has been assigned a character yet).

> Unicode uses the
> term "character" to mean more or less "the entity that's represented by a
> single Unicode code point," but this concept doesn't always match the user's
> definition of "character".

And _that_ is talking about two other issues:
(1) Unicode classifies code points as Graphic, Format, Control, Private-Use,
    Surrogate, Noncharacter, or Reserved.  Only the Graphic characters are
    ones that users are likely to think of as characters.
(2) Things that the user thinks of as a character (like é) may be represented
    by sequences of code points, called Grapheme Clusters, consisting of a
    base character and some nonspacing marks.  This has nothing to do with
    encodings.

> I think a more complete design would represent a character as a binary that is
> a UTF8 encoding of its code points. A string would then be a deep list of
> these binaries.

Once again, a Unicode character has *by definition* one code point;
and from a storage point of view, it's pretty silly to use a big thing like
a binary to represent a 21-bit integer.

The main principle to understand about Unicode is to *always* think in terms of
strings, not of characters.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: byte() vs. char() use in documentation

Masklinn
On 2011-05-05, at 02:03 , Richard O'Keefe wrote:

> On 4/05/2011, at 7:33 AM, Anthony Shipman wrote:
>
>> On Tue, 3 May 2011 07:45:49 pm Raimo Niskanen wrote:
>>> The programmer should regard strings as a sequence of unicode code points.
>>> As such they are just that and there is no encoding to bother about.
>>> The code point number uniquely defines which unicode character it is.
>>
>> As I recall, a Unicode character can be composed of up to 7 code points.
>
> I find this rather confusing.
> Here are some official definitions from the Unicode standard:
>
>  Character.
>    (1) The smallest component of written language that has semantic value;
>        refers to the abstract meaning and/or shape, rather than a specific
>        shape (see also glyph), though in code tables some form of visual
>        representation is essential for the reader’s understanding.
>    (2) Synonym for abstract character.
>    (3) The basic unit of encoding for the Unicode character encoding.
>    (4) The English name for the ideographic written elements of Chinese origin.
>        [See ideograph (2).]
>
>  Coded Character Set.
>    A character set in which each character is assigned a numeric code point.
>    Frequently abbreviated as character set, charset, or code set;
>    the acronym CCS is also used.
>
>  Code Point.
>    (1) Any value in the Unicode codespace; that is, the range of integers
>        from 0 to 10FFFF(base 16).
>        (See definition D10 in Section 3.4, Characters and Encoding.)
>    (2) A value, or position, for a character, in any coded character set.
>
>  Code Unit.
>    The minimal bit combination that can represent a unit of encoded text
>    for processing or interchange.  The Unicode Standard uses 8-bit code units
>    in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form,
>    and 32-bit code units in the UTF-32 encoding form.
>    (See definition D77 in Section 3.9, Unicode Encoding Forms.)
>
> Each Unicode character has *BY DEFINITION* precisely *ONE* code point.
> A code point is a number in the range 0 to 1,114,111.
>
> The largest legal Unicode code point (hex 10FFFF) requires precisely
> FOUR code units:
>
> 11110100 10001111 10111111 10111111      
> -----  3 --     6 --     6 --     6
>
> The 11110 prefix on the leading byte says "here are four bytes";
> the "10" prefixes on the remaining bytes say "here are 6 more bits".
UTF-8 makes allowance for full 31 bits code points (6 code units when encoded)
though, which may trip people up.

> No Unicode code point requires more than four code units.
For now.

> And _that_ is talking about two other issues:
I strongly disagree. I believe this is the *core* of the whole issue, and
*this* is the reason why people are confused: a complete mastery of the unicode
lingo (which the standard's definitions does not even provide, as you mentioned
in your comment "Character has 4 different definitions, most of which can not
be understood by the lay man) and a very good capacity to differentiate common
speech and unicode lingo are necessary to navigate unicode discussions correctly.

The vast majority of developers do *not* possess these (not necessarily for lack
of trying), and the differences in the status of (mostly) the word "character"
(which can be hard to understand from context) lead to a minefield of
misunderstanding and frustration.

I strongly believe it was a mistake for the Unicode consortium to use this word.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: byte() vs. char() use in documentation

Lionel Cons
In reply to this post by Raimo Niskanen-2
AFAIK, the confusion comes from two different uses of the term "character".

The "individual character" is at the heart of Unicode. Each individual
character maps to a unique code point. For instance, a lowercase alpha is
the character named "GREEK SMALL LETTER ALPHA" and maps to code point
U+03B1. The Unicode code points are between 0 and 0x10FFFF.

The "logical character" is what human beings usually have in mind. In the
real world, a text is a sequence of logical characters. An example of such
a character is the lowercase letter "e" with an acute accent.

Some logical characters do not map directly to individual characters and
must be represented as a combination of several individual characters (this
is called I think an "extended grapheme cluster").

Some logical characters do map to individual characters and can therefore
have two different representations in Unicode:
 - with an individual character
 - with a combination of several individual characters

For instance, our "e" with an acute accent can be represented as:
 - the individual character "LATIN SMALL LETTER E WITH ACUTE" (U+00E9)
or
 - the combination "LATIN SMALL LETTER E" (U+0065) plus "COMBINING ACUTE
   ACCENT" (U+0301)

To cope with this, Unicode defines the notions of canonical and compatible
equivalence (see http://en.wikipedia.org/wiki/Unicode_equivalence).

To come back to the point, we have to define what we mean with the Erlang
char() type:
 - if it's an individual character then it can naturally be represented as
   a single integer for its code point
 - if it's a logical character then it has to be a list of integers

In any case, the language must provide specific functions to work on strings
and characters. For instance, a logical character comparison must take into
account the Unicode equivalence.

Cheers,
__________________________________________________________
Lionel Cons        http://cern.ch/lionel.cons
CERN               http://cern.ch
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
12