|
In the Erlang documentation, the language of types and specs makes a
clear distinction between the following two types: byte() :: 0..255 char() :: 0..16#10ffff See http://erlang.org/doc/reference_manual/typespec.html#id72693 I think that nowadays there are very good reasons to have this distinction. In trying to fix a bug today, I happened to notice that some key types of Erlang are inconsistent with this view in the Erlang/OTP documentation (In http://erlang.org/doc/man/erlang.html), most notably: iolist() :: [char() | binary() | iolist()] binary_to_list(Binary) -> [char()] binary_to_list(Binary, Start, Stop) -> [char()] bitstring_to_list(Bitstring) -> [char()|bitstring()] and: BitstringList :: [BitstringList | bitstring() | char()] which actually triggered this mail. I think all the occurrences of char() above should read byte() instead. Right? If yes, could somebody at OTP (or some kind volunteer) please clean up this mess? (I can provide a fix for the documentation of the 'erlang' module if you want me to.) Kostis _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
more of a question than an actual answer, but in erlang can erlang strings ( therefore io-lists) be utf-16?
I assume that binaries are obviously only ever utf8 representation, but a list of ints can obviously exceed number above 255.. so maybe (??) the answer is a) iolist CAN be a char() (.. this is surely especially true if the data is only being messages threw erlang from other systems) b) the binary to list are a bit less easy basically it can't be a char(), because it will always have started off as an 8bit ( utf8 ) representation so it will always come back as a list of byte() but in the general case, it's returning an io-list and that can be a char() is this correct? and in that case does that make the bif's xml doc file in fact correct? James On 28 Apr 2011, at 17:26, Kostis Sagonas wrote: > In the Erlang documentation, the language of types and specs makes a clear distinction between the following two types: > > byte() :: 0..255 > char() :: 0..16#10ffff > > See http://erlang.org/doc/reference_manual/typespec.html#id72693 > > I think that nowadays there are very good reasons to have this distinction. > > > In trying to fix a bug today, I happened to notice that some key types of Erlang are inconsistent with this view in the Erlang/OTP documentation (In http://erlang.org/doc/man/erlang.html), most notably: > > iolist() :: [char() | binary() | iolist()] > > binary_to_list(Binary) -> [char()] > binary_to_list(Binary, Start, Stop) -> [char()] > bitstring_to_list(Bitstring) -> [char()|bitstring()] > > and: > > BitstringList :: [BitstringList | bitstring() | char()] > > which actually triggered this mail. > > I think all the occurrences of char() above should read byte() instead. > Right? > > If yes, could somebody at OTP (or some kind volunteer) please clean up this mess? (I can provide a fix for the documentation of the 'erlang' module if you want me to.) > > Kostis > _______________________________________________ > erlang-questions mailing list > [hidden email] > http://erlang.org/mailman/listinfo/erlang-questions _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
On 2 mai 2011, at 01:01, James Churchman <[hidden email]> wrote:
> more of a question than an actual answer, but in erlang can erlang strings ( therefore io-lists) be utf-16? > The 16#10ffff upper bound indicates iolists are likely encoded in UCS-4. _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
In reply to this post by James Churchman
On Mon, May 02, 2011 at 12:01:49AM +0100, James Churchman wrote:
> more of a question than an actual answer, but in erlang can erlang strings ( therefore io-lists) be utf-16? A string is a list of unicode code points. An IO-list is a list of binaries or bytes. > > I assume that binaries are obviously only ever utf8 representation, but a list of ints can obviously exceed number above 255.. You can choose your binary representation. See erlang man page unicode(3). > > so maybe (??) the answer is > > a) iolist CAN be a char() (.. this is surely especially true if the data is only being messages threw erlang from other systems) No. byte(). > > b) the binary to list are a bit less easy Compare erlang:binary_to_list/1 and erlang:list_to_binary/1 with the corresponding functions in module 'unicode'. > > basically it can't be a char(), because it will always have started off as an 8bit ( utf8 ) representation so it will always come back as a list of byte() but in the general case, it's returning an io-list and that can be a char() > > is this correct? and in that case does that make the bif's xml doc file in fact correct? The documentation is incorrect. Once there was no difference between char() and byte(). char() ment a ISO-8859-1 character which is the same size as byte(). > > James > > On 28 Apr 2011, at 17:26, Kostis Sagonas wrote: > > > In the Erlang documentation, the language of types and specs makes a clear distinction between the following two types: > > > > byte() :: 0..255 > > char() :: 0..16#10ffff > > > > See http://erlang.org/doc/reference_manual/typespec.html#id72693 > > > > I think that nowadays there are very good reasons to have this distinction. > > > > > > In trying to fix a bug today, I happened to notice that some key types of Erlang are inconsistent with this view in the Erlang/OTP documentation (In http://erlang.org/doc/man/erlang.html), most notably: > > > > iolist() :: [char() | binary() | iolist()] > > > > binary_to_list(Binary) -> [char()] > > binary_to_list(Binary, Start, Stop) -> [char()] > > bitstring_to_list(Bitstring) -> [char()|bitstring()] > > > > and: > > > > BitstringList :: [BitstringList | bitstring() | char()] > > > > which actually triggered this mail. > > > > I think all the occurrences of char() above should read byte() instead. > > Right? > > > > If yes, could somebody at OTP (or some kind volunteer) please clean up this mess? (I can provide a fix for the documentation of the 'erlang' module if you want me to.) > > > > Kostis > > _______________________________________________ > > erlang-questions mailing list > > [hidden email] > > http://erlang.org/mailman/listinfo/erlang-questions > > _______________________________________________ > erlang-questions mailing list > [hidden email] > http://erlang.org/mailman/listinfo/erlang-questions -- / Raimo Niskanen, Erlang/OTP, Ericsson AB _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
On Mon, May 02, 2011 at 09:35:18AM +0000, Robert Virding wrote:
> > ----- "Raimo Niskanen" <[hidden email]> wrote: > > > On Mon, May 02, 2011 at 12:01:49AM +0100, James Churchman wrote: > > > more of a question than an actual answer, but in erlang can erlang > > strings ( therefore io-lists) be utf-16? > > > > A string is a list of unicode code points. > > > > An IO-list is a list of binaries or bytes. > > > > > > > > I assume that binaries are obviously only ever utf8 representation, > > but a list of ints can obviously exceed number above 255.. > > > > You can choose your binary representation. See erlang man page > > unicode(3). > > > > > > > > so maybe (??) the answer is > > > > > > a) iolist CAN be a char() (.. this is surely especially true if the > > data is only being messages threw erlang from other systems) > > > > No. byte(). > > As a string is a list of unicode code points and an iolist can contain a string then its type must also be char(). No. As it stands now a string is a list of unicode code points and can not be contained in an iolist. This became messy when char() was re-defined from latin-1 character to unicode character. That affected string() that affected iolist() and the latter was incorrect. We must clean up the mess. Either by completing the notion of char() being unicode and hence rewriting iolist() to contain byte() and binary(), or by reverting to char() being latin-1 char and using unicode:char() and unicode:string() where that is correct... > > Robert -- / Raimo Niskanen, Erlang/OTP, Ericsson AB _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
Raimo Niskanen wrote:
> > This became messy when char() was re-defined from latin-1 character > to unicode character. That affected string() that affected iolist() > and the latter was incorrect. > > We must clean up the mess. Right. The sooner it happens the better it is. > ... Either by completing the notion of char() > being unicode and hence rewriting iolist() to contain byte() and binary(), > or by reverting to char() being latin-1 char and using unicode:char() > and unicode:string() where that is correct... Please, by all means do the former. The latter will only cause havoc everywhere. For starters, I do not see any need in having two different basic types (byte() and char()) denoting (pretty much) the same thing. The only thing this does is cause unnecessary confusion to newcomers (and apparently to some old-timers too). Second, if you choose the latter you will eventually have to change lots of type inference code, because I promise you I will not do this, and believe me you don't want to go there... (The Vietnam jungle is probably a friendlier place ;) ) Cheers, Kostis _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
So just for my own understanding, and as it seems extremely important ( strings are quite important these days!), as it stands now:
iolists cant can only ( officially?) contain utf8? ( as no utf8 code point will exceed 255, like latin1 / asci, and are therefor are all byte() )
strings can be of utf8 utf16 or utf32, but only the utf8 version is allowed in an iolist? ( and therefore if you wanted an "iolist" ( eg a non flat list of chars) that contained utf 16 or 32 code points you would have to stick exclusively to lists ( strings) and not binaries and use lists:flatten before you finished with it, to remove all the nested lists ) binaries can be of any unicode type.. also there does seem to be a needed distinction between char() and byte() as they are not the same at all, but the documentation is wrong as at the moment iolists can infact only contain byte() not char()
the suggested direction is to repair the docs so that they specify only allowing 0~255 ints( byte() ) in iolists rather than allowing io-lists to contain any string as they did before the introduction of unicode / in the days of latin1 etc.. ?
i think that that goes agents most ( even erlang implementers :-) ) opinion of what an iolist is ( that being a list of any valid string or binary) but maybe ( to raise a totally different problem) would prevent the possibility of an iolist having a mixed unicode type and still begin "valid" ( even tho i guess this is still possible as binaries can in fact be other utf representations)
On 2 May 2011 11:13, Kostis Sagonas <[hidden email]> wrote:
_______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
On 3/05/2011, at 7:43 AM, James Churchman wrote: > strings can be of utf8 utf16 or utf32, No. The model for strings is "one list element = one unicode character", and both UTF-8 and UTF-16 violate that. A list of ASCII code-points is both a (Unicode) string and an iolist. Of course, nothing stops you holding an abstract string as a list of octets using UTF-8 (or for that matter, UTF-EBCDIC) or as a list of 16-bit units using UTF-16. It's just that if you do so, what you have doesn't count as an Erlang string any more (outside ASCII). > also there does seem to be a needed distinction between char() and byte() as they are not the same at all, but the documentation is wrong as at the moment iolists can infact only contain byte() not char() yes. _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
In reply to this post by James Churchman
On Mon, May 02, 2011 at 08:43:33PM +0100, James Churchman wrote:
> So just for my own understanding, and as it seems extremely important ( > strings are quite important these days!), as it stands now: > > iolists cant can only ( officially?) contain utf8? ( as no utf8 code point > will exceed 255, like latin1 / asci, and are therefor are all byte() ) Richard O'Keefe explained this nicely, I'll just elaborate. iolists were when introduced for handling byte sequences, not having to copy but just building them in nested form, either from individual bytes or from binaries. Back then characters were only latin-1 hence matched bytes nicely and therefore iolists. This is no longer true. Now you will have to do a translation from a seqence of characters into the corresponding byte sequence in an iolist. The preferred representation in erlang is utf-8 since it is the default for e.g the unicode module and for the ~t modifier in the io module when printing strings > > strings can be of utf8 utf16 or utf32, but only the utf8 version is allowed The programmer should regard strings as a sequence of unicode code points. As such they are just that and there is no encoding to bother about. The code point number uniquely defines which unicode character it is. > in an iolist? ( and therefore if you wanted an "iolist" ( eg a non flat list UTF-8 is not the only encoding allowed in an iolist. You can do any encoding as you desire. If you use the unicode module the default format for encoding and decoding of binaries is utf8, but utf16 or utf32 big or little endian is easy to do. An iolist is just a sequence of bytes. > of chars) that contained utf 16 or 32 code points you would have to stick An iolist is a non-flat list of bytes. Do not mixup bytes with characters. > exclusively to lists ( strings) and not binaries and use lists:flatten You can not just use lists:flatten on a unicode character string to get an iolist. The Unicode code points > 255 are still there. You will have to encode the unicode characters into a suitable byte representation e.g using the unicode module. > before you finished with it, to remove all the nested lists ) > > binaries can be of any unicode type.. Binaries are sequences of bytes. Period. You decide what they mean. > > also there does seem to be a needed distinction between char() and byte() as > they are not the same at all, but the documentation is wrong as at the > moment iolists can infact only contain byte() not char() Yes. > > the suggested direction is to repair the docs so that they specify only > allowing 0~255 ints( byte() ) in iolists rather than allowing io-lists to > contain any string as they did before the introduction of unicode / in the > days of latin1 etc.. ? Yes. That iolists could contain any string was by accident since there were no characters > 255 in the days of latin1. Since iolists is about sequences of bytes they can not be fixed into being allowed to contain any character. For that to be possible you would have to define the byte encoding for iolists, or store the byte encoding with a particular iolist. Since there is so many byte encodings that are used it is better to make this visible to the programmer so he/she is forced to understand the byte encoding problem and to handle it explicitly. Therefore is iolists now as the always were secuences of bytes (8-bit). And that is all. > > > i think that that goes agents most ( even erlang implementers :-) ) opinion > of what an iolist is ( that being a list of any valid string or binary) but I think not. An iolist is any valid byte or binary sequence. Binaries are sequences of bytes. They are all about bytes. Characters and strings are today vastly more complex beasts than they were when US-ASCII and later ISO-LATIN-1 was the norm. This must be visible to the programmer. > maybe ( to raise a totally different problem) would prevent the possibility > of an iolist having a mixed unicode type and still begin "valid" ( even tho > i guess this is still possible as binaries can in fact be other utf > representations) I repeat again. The programmer decides what the bytes mean. The list [0,0,16#21,16#2b] e.g would mean "angstrom sign" if the encoding is UTF-32 big endian. And that is a valid iolist. But [16#212b] is not. > > > > On 2 May 2011 11:13, Kostis Sagonas <[hidden email]> wrote: > > > Raimo Niskanen wrote: > > > >> > >> This became messy when char() was re-defined from latin-1 character > >> to unicode character. That affected string() that affected iolist() > >> and the latter was incorrect. > >> > >> We must clean up the mess. > >> > > > > Right. The sooner it happens the better it is. > > > > ... Either by completing the notion of char() > >> > >> being unicode and hence rewriting iolist() to contain byte() and binary(), > >> or by reverting to char() being latin-1 char and using unicode:char() > >> and unicode:string() where that is correct... > >> > > > > Please, by all means do the former. The latter will only cause havoc > > everywhere. For starters, I do not see any need in having two different > > basic types (byte() and char()) denoting (pretty much) the same thing. The > > only thing this does is cause unnecessary confusion to newcomers (and > > apparently to some old-timers too). Second, if you choose the latter you > > will eventually have to change lots of type inference code, because I > > promise you I will not do this, and believe me you don't want to go there... > > (The Vietnam jungle is probably a friendlier place ;) ) > > > > Cheers, > > Kostis > > > > _______________________________________________ > > erlang-questions mailing list > > [hidden email] > > http://erlang.org/mailman/listinfo/erlang-questions > > > _______________________________________________ > erlang-questions mailing list > [hidden email] > http://erlang.org/mailman/listinfo/erlang-questions -- / Raimo Niskanen, Erlang/OTP, Ericsson AB _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
On Tue, 3 May 2011 07:45:49 pm Raimo Niskanen wrote:
> The programmer should regard strings as a sequence of unicode code points. > As such they are just that and there is no encoding to bother about. > The code point number uniquely defines which unicode character it is. As I recall, a Unicode character can be composed of up to 7 code points. To quote a text book I'm looking at now: ------------- The trick is, again, to disabuse yourself of the idea that a one-to-one correspondence exists between "characters" as the user is used to thinking of them and code points (or code units) in the backing store. Unicode uses the term "character" to mean more or less "the entity that's represented by a single Unicode code point," but this concept doesn't always match the user's definition of "character". ------------- I think a more complete design would represent a character as a binary that is a UTF8 encoding of its code points. A string would then be a deep list of these binaries. -- Anthony Shipman Mamas don't let your babies [hidden email] grow up to be outsourced. _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
In reply to this post by Raimo Niskanen-2
On Tuesday, May 03, 2011, Raimo Niskanen wrote:
> I repeat again. The programmer decides what the bytes mean. The list > [0,0,16#21,16#2b] e.g would mean "angstrom sign" if the encoding is > UTF-32 big endian. And that is a valid iolist. > But [16#212b] is not. Out of curiosity, why does unicode:characters_to_binary([16#212b], {utf32, big}). return the UTF-8 representation of Å (Angstrom sign) and not the big-endian UTF-32 like I expected? _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
In reply to this post by Anthony Shipman
On Tuesday, May 03, 2011, Anthony Shipman wrote:
> I think a more complete design would represent a character as a binary > that is > a UTF8 encoding of its code points. A string would then be a deep list > of > these binaries. How is that superior than representing a character by a single integer representing the Unicode codepoint, a string by a list of characters? You can always use unicode:characters_to_binary/1 to convert to a UTF-8 binary if you wish. _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
In reply to this post by Anthony Shipman
On Wed, May 04, 2011 at 05:33:58AM +1000, Anthony Shipman wrote:
> On Tue, 3 May 2011 07:45:49 pm Raimo Niskanen wrote: > > The programmer should regard strings as a sequence of unicode code points. > > As such they are just that and there is no encoding to bother about. > > The code point number uniquely defines which unicode character it is. > > As I recall, a Unicode character can be composed of up to 7 code points. > To quote a text book I'm looking at now: > ------------- > The trick is, again, to disabuse yourself of the idea that a one-to-one > correspondence exists between "characters" as the user is used to thinking of > them and code points (or code units) in the backing store. Unicode uses the > term "character" to mean more or less "the entity that's represented by a > single Unicode code point," but this concept doesn't always match the user's > definition of "character". > ------------- There seems to be a terminology here clash that I will remember for the future. When I talked about "Unicode code points" I ment the character number in the Unicode system. I did not think it was allowed to talk about "code points" when talking about byte encoded data. There are text books that talk about "code points (or code units) in the backing store". I find that very confusing. I will aways call it "byte encoding" or something like that. > > I think a more complete design would represent a character as a binary that is > a UTF8 encoding of its code points. A string would then be a deep list of > these binaries. > > -- > Anthony Shipman Mamas don't let your babies > [hidden email] grow up to be outsourced. > _______________________________________________ > erlang-questions mailing list > [hidden email] > http://erlang.org/mailman/listinfo/erlang-questions -- / Raimo Niskanen, Erlang/OTP, Ericsson AB _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
In reply to this post by dmercer
On Tue, May 03, 2011 at 03:05:24PM -0500, David Mercer wrote:
> On Tuesday, May 03, 2011, Raimo Niskanen wrote: > > > I repeat again. The programmer decides what the bytes mean. The list > > [0,0,16#21,16#2b] e.g would mean "angstrom sign" if the encoding is > > UTF-32 big endian. And that is a valid iolist. > > But [16#212b] is not. > > Out of curiosity, why does > > unicode:characters_to_binary([16#212b], {utf32, big}). Man page says: characters_to_binary(Data,InEncoding) -> binary() | ... You changed the InEncoding to {utf32,big}. You want this: characters_to_binary(Data, InEncoding, OutEncoding) -> binary() | ... 1> unicode:characters_to_binary([16#212b], unicode, {utf32, big}). <<0,0,33,43>> > > return the UTF-8 representation of ?$B"r (Angstrom sign) and not the big-endian > UTF-32 like I expected? InEncoding only applies to binaries in the indata since integers are just Unicode code points and have no encoding: 2> unicode:characters_to_binary([16#212b,<<226,132,171>>], utf8, {utf32, big}). <<0,0,33,43,0,0,33,43>> Note: unicode is an alias for utf8 in the unicode module since utf8 is the default encoding It is all in the Erlang man page for unicode(3). -- / Raimo Niskanen, Erlang/OTP, Ericsson AB _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
In reply to this post by Raimo Niskanen-2
On 2011-05-04, at 09:57 , Raimo Niskanen wrote: > On Wed, May 04, 2011 at 05:33:58AM +1000, Anthony Shipman wrote: >> On Tue, 3 May 2011 07:45:49 pm Raimo Niskanen wrote: >>> The programmer should regard strings as a sequence of unicode code points. >>> As such they are just that and there is no encoding to bother about. >>> The code point number uniquely defines which unicode character it is. >> >> As I recall, a Unicode character can be composed of up to 7 code points. >> To quote a text book I'm looking at now: >> ------------- >> The trick is, again, to disabuse yourself of the idea that a one-to-one >> correspondence exists between "characters" as the user is used to thinking of >> them and code points (or code units) in the backing store. Unicode uses the >> term "character" to mean more or less "the entity that's represented by a >> single Unicode code point," but this concept doesn't always match the user's >> definition of "character". >> ------------- > > There seems to be a terminology here clash that I will remember for the future. > When I talked about "Unicode code points" I ment the character number > in the Unicode system. I did not think it was allowed to talk about "code points" > when talking about byte encoded data. code points as themselves. So many people make the shortcut (furthermore most people aren't really interested in understanding Unicode — and I can understand that, it's a drag — so they mix unicode-lingo with "normal" speech leading to less-than-sensical results). I believe the issue Anthony mentions here is the difference between glyphs and code points (combining marks) rather than the difference between code points and on-disk bytes (resulting from Unicode encoding): a "visible character" (e.g. į̇́) can be composed of multiple code points, one "base" code point and a number of combining marks code points (diacritics being the main offender) (nb: the glyph "į̇́" is, in fact, composed of three code points: U+012F, U+0307 and U+0301). What most users think of as a character is what unicode calls a glyph: it's the graphical representation of a group of combined code points (that group may be unary). Whereas in Unicode, a character is the graphical representation of a single code point. As a result, a "user" character may be composed of a number of "unicode" characters. _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
In reply to this post by dmercer
On Wed, 4 May 2011 06:05:24 am David Mercer wrote:
> On Tuesday, May 03, 2011, Anthony Shipman wrote: > > I think a more complete design would represent a character as a binary > > that is > > a UTF8 encoding of its code points. A string would then be a deep list > > of > > these binaries. > > How is that superior than representing a character by a single integer > representing the Unicode codepoint, a string by a list of characters? You > can always use unicode:characters_to_binary/1 to convert to a UTF-8 binary > if you wish. What we think of as a character, e.g. some letter on a page, can be a combination of a base component and some combining components. (I use the word component since I'm not quite sure at the moment exactly what a glyph means. A component is represented by a code point). Combining components include accents and a variety of other marks that some languages attach to the base component. For example in French the "e-acute" could be represented as a single code point or as a pair of the code points for "e" and "acute accent". The standard puts some effort into defining a canonical representation so that it isn't a total nightmare to tell if two characters are the same. You have to convert a Unicode string to its canonical form before you can test for equality. To fully implement the intent of Unicode we need to talk in terms of characters, i.e. something you may insert or delete in a word processor, which may themselves be a sequence of code points which are kept together. -- Anthony Shipman Mamas don't let your babies [hidden email] grow up to be outsourced. _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
In reply to this post by Anthony Shipman
On 4/05/2011, at 7:33 AM, Anthony Shipman wrote: > On Tue, 3 May 2011 07:45:49 pm Raimo Niskanen wrote: >> The programmer should regard strings as a sequence of unicode code points. >> As such they are just that and there is no encoding to bother about. >> The code point number uniquely defines which unicode character it is. > > As I recall, a Unicode character can be composed of up to 7 code points. I find this rather confusing. Here are some official definitions from the Unicode standard: Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader’s understanding. (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. (4) The English name for the ideographic written elements of Chinese origin. [See ideograph (2).] Coded Character Set. A character set in which each character is assigned a numeric code point. Frequently abbreviated as character set, charset, or code set; the acronym CCS is also used. Code Point. (1) Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF(base 16). (See definition D10 in Section 3.4, Characters and Encoding.) (2) A value, or position, for a character, in any coded character set. Code Unit. The minimal bit combination that can represent a unit of encoded text for processing or interchange. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form. (See definition D77 in Section 3.9, Unicode Encoding Forms.) Each Unicode character has *BY DEFINITION* precisely *ONE* code point. A code point is a number in the range 0 to 1,114,111. The largest legal Unicode code point (hex 10FFFF) requires precisely FOUR code units: 11110100 10001111 10111111 10111111 ----- 3 -- 6 -- 6 -- 6 The 11110 prefix on the leading byte says "here are four bytes"; the "10" prefixes on the remaining bytes say "here are 6 more bits". No Unicode code point requires more than four code units. > To quote a text book I'm looking at now: > ------------- > The trick is, again, to disabuse yourself of the idea that a one-to-one > correspondence exists between "characters" as the user is used to thinking of > them and code points (or code units) in the backing store. Sorry, it looks as though you need a better text book. Code points and code units are NOT the same thing (at least for UTF-8 and UTF-16). There IS, by definition, a direct correspondence between Unicode characters and code points (not every code point has been assigned a character yet). > Unicode uses the > term "character" to mean more or less "the entity that's represented by a > single Unicode code point," but this concept doesn't always match the user's > definition of "character". And _that_ is talking about two other issues: (1) Unicode classifies code points as Graphic, Format, Control, Private-Use, Surrogate, Noncharacter, or Reserved. Only the Graphic characters are ones that users are likely to think of as characters. (2) Things that the user thinks of as a character (like é) may be represented by sequences of code points, called Grapheme Clusters, consisting of a base character and some nonspacing marks. This has nothing to do with encodings. > I think a more complete design would represent a character as a binary that is > a UTF8 encoding of its code points. A string would then be a deep list of > these binaries. Once again, a Unicode character has *by definition* one code point; and from a storage point of view, it's pretty silly to use a big thing like a binary to represent a 21-bit integer. The main principle to understand about Unicode is to *always* think in terms of strings, not of characters. _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
On 2011-05-05, at 02:03 , Richard O'Keefe wrote:
> On 4/05/2011, at 7:33 AM, Anthony Shipman wrote: > >> On Tue, 3 May 2011 07:45:49 pm Raimo Niskanen wrote: >>> The programmer should regard strings as a sequence of unicode code points. >>> As such they are just that and there is no encoding to bother about. >>> The code point number uniquely defines which unicode character it is. >> >> As I recall, a Unicode character can be composed of up to 7 code points. > > I find this rather confusing. > Here are some official definitions from the Unicode standard: > > Character. > (1) The smallest component of written language that has semantic value; > refers to the abstract meaning and/or shape, rather than a specific > shape (see also glyph), though in code tables some form of visual > representation is essential for the reader’s understanding. > (2) Synonym for abstract character. > (3) The basic unit of encoding for the Unicode character encoding. > (4) The English name for the ideographic written elements of Chinese origin. > [See ideograph (2).] > > Coded Character Set. > A character set in which each character is assigned a numeric code point. > Frequently abbreviated as character set, charset, or code set; > the acronym CCS is also used. > > Code Point. > (1) Any value in the Unicode codespace; that is, the range of integers > from 0 to 10FFFF(base 16). > (See definition D10 in Section 3.4, Characters and Encoding.) > (2) A value, or position, for a character, in any coded character set. > > Code Unit. > The minimal bit combination that can represent a unit of encoded text > for processing or interchange. The Unicode Standard uses 8-bit code units > in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, > and 32-bit code units in the UTF-32 encoding form. > (See definition D77 in Section 3.9, Unicode Encoding Forms.) > > Each Unicode character has *BY DEFINITION* precisely *ONE* code point. > A code point is a number in the range 0 to 1,114,111. > > The largest legal Unicode code point (hex 10FFFF) requires precisely > FOUR code units: > > 11110100 10001111 10111111 10111111 > ----- 3 -- 6 -- 6 -- 6 > > The 11110 prefix on the leading byte says "here are four bytes"; > the "10" prefixes on the remaining bytes say "here are 6 more bits". though, which may trip people up. > No Unicode code point requires more than four code units. For now. > And _that_ is talking about two other issues: I strongly disagree. I believe this is the *core* of the whole issue, and *this* is the reason why people are confused: a complete mastery of the unicode lingo (which the standard's definitions does not even provide, as you mentioned in your comment "Character has 4 different definitions, most of which can not be understood by the lay man) and a very good capacity to differentiate common speech and unicode lingo are necessary to navigate unicode discussions correctly. The vast majority of developers do *not* possess these (not necessarily for lack of trying), and the differences in the status of (mostly) the word "character" (which can be hard to understand from context) lead to a minefield of misunderstanding and frustration. I strongly believe it was a mistake for the Unicode consortium to use this word. _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
In reply to this post by Raimo Niskanen-2
AFAIK, the confusion comes from two different uses of the term "character".
The "individual character" is at the heart of Unicode. Each individual character maps to a unique code point. For instance, a lowercase alpha is the character named "GREEK SMALL LETTER ALPHA" and maps to code point U+03B1. The Unicode code points are between 0 and 0x10FFFF. The "logical character" is what human beings usually have in mind. In the real world, a text is a sequence of logical characters. An example of such a character is the lowercase letter "e" with an acute accent. Some logical characters do not map directly to individual characters and must be represented as a combination of several individual characters (this is called I think an "extended grapheme cluster"). Some logical characters do map to individual characters and can therefore have two different representations in Unicode: - with an individual character - with a combination of several individual characters For instance, our "e" with an acute accent can be represented as: - the individual character "LATIN SMALL LETTER E WITH ACUTE" (U+00E9) or - the combination "LATIN SMALL LETTER E" (U+0065) plus "COMBINING ACUTE ACCENT" (U+0301) To cope with this, Unicode defines the notions of canonical and compatible equivalence (see http://en.wikipedia.org/wiki/Unicode_equivalence). To come back to the point, we have to define what we mean with the Erlang char() type: - if it's an individual character then it can naturally be represented as a single integer for its code point - if it's a logical character then it has to be a list of integers In any case, the language must provide specific functions to work on strings and characters. For instance, a logical character comparison must take into account the Unicode equivalence. Cheers, __________________________________________________________ Lionel Cons http://cern.ch/lionel.cons CERN http://cern.ch _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
|
On Thu, May 05, 2011 at 09:16:39AM +0200, Lionel Cons wrote:
> AFAIK, the confusion comes from two different uses of the term "character". > > The "individual character" is at the heart of Unicode. Each individual > character maps to a unique code point. For instance, a lowercase alpha is > the character named "GREEK SMALL LETTER ALPHA" and maps to code point > U+03B1. The Unicode code points are between 0 and 0x10FFFF. > > The "logical character" is what human beings usually have in mind. In the > real world, a text is a sequence of logical characters. An example of such > a character is the lowercase letter "e" with an acute accent. > > Some logical characters do not map directly to individual characters and > must be represented as a combination of several individual characters (this > is called I think an "extended grapheme cluster"). > > Some logical characters do map to individual characters and can therefore > have two different representations in Unicode: > - with an individual character > - with a combination of several individual characters > > For instance, our "e" with an acute accent can be represented as: > - the individual character "LATIN SMALL LETTER E WITH ACUTE" (U+00E9) > or > - the combination "LATIN SMALL LETTER E" (U+0065) plus "COMBINING ACUTE > ACCENT" (U+0301) > > To cope with this, Unicode defines the notions of canonical and compatible > equivalence (see http://en.wikipedia.org/wiki/Unicode_equivalence). > > To come back to the point, we have to define what we mean with the Erlang > char() type: > - if it's an individual character then it can naturally be represented as > a single integer for its code point > - if it's a logical character then it has to be a list of integers The Erlang char() type today must then according to your excellent clarification be defined as a Unicode individual character, range 0 upto 0x10FFFF (there are invalid values, right). > > In any case, the language must provide specific functions to work on strings > and characters. For instance, a logical character comparison must take into > account the Unicode equivalence. That is as far as I know unimplemented funcionality. Some may fit into the unicode module, and some might be left to a text processing application to implement. Just to implement Unicode equivalence sounds complicated and as a moving target, or somthing best implemented by OS libraries. > > Cheers, > __________________________________________________________ > Lionel Cons http://cern.ch/lionel.cons > CERN http://cern.ch > _______________________________________________ > erlang-questions mailing list > [hidden email] > http://erlang.org/mailman/listinfo/erlang-questions -- / Raimo Niskanen, Erlang/OTP, Ericsson AB _______________________________________________ erlang-questions mailing list [hidden email] http://erlang.org/mailman/listinfo/erlang-questions |
| Powered by Nabble | Edit this page |
