>> << An Erlang "string" is simply a list of integers. Each integer can
>> represent any Unicode codepoint/character. >>
>
> Except that Unicode codepoints represents characters, right?
Depends, the definition of "character" is quite ambiguous.
By "character", many people mean what Unicode calls "grapheme" (a concrete
shape or shape-group displayed on a medium[-1]). The meaning of the word may
also change across cultures, for instance concerning diacritics some
cultures consider the base+diacritic(s) as a single character, others as
multiples. And it becomes very tought to define for e.g. hangul, is the
character a hangul block or the jamo composing it?[0]
The Unicode Standard itself lists 4 different and potentially
incompatible meanings for the word:
(1) The smallest component of written language that has semantic value;
refers to the abstract meaning and/or shape, rather than a specific
shape (see also glyph), though in code tables some form of visual
representation is essential for the reader’s understanding.
(2) Synonym for abstract character.
(3) The basic unit of encoding for the Unicode character encoding.
(4) The English name for the ideographic written elements of Chinese origin
where "abstract character" is defined as:
A unit of information used for the organization, control, or
representation of textual data.
* When representing data, the nature of that data is generally symbolic
as opposed to some other kind of data (for example, aural or visual).
Examples of such symbolic data include letters, ideographs, digits,
punctuation, technical symbols, and dingbats.
* An abstract character has no concrete form and should not be confused
with a glyph.
* An abstract character does not necessarily correspond to what a user
thinks of as a “character” and should not be confused with a grapheme.
* The abstract characters encoded by the Unicode Standard are known as
Unicode abstract characters.
* Abstract characters not directly encoded by the Unicode Standard can
often be represented by the use of combining character sequences.
In most meanings of the word "character", a character maps to a
(potentially unary) *sequence* of unicode code-points, there isn't a
1:1 mapping.
[-1] don't take my word for it, I might have fucked up my recollection,
I regularly get confused between the precise meanings of "glyph"
and "grapheme"
[0] Modern hangul is written as syllabic blocks, each block is composed
of three jamo (letters, technically 2 to 5 for ancient/historical
texts). For instance 한 (han) is a block composed of three jamo ㅎ,
ㅏ, and ㄴ, and unicode allows encoding it as either HANGUL SYLLABLE
HAN or a sequence of HANGUL CHOSEONG HIEUH, HANGUL JUNGSEONG A and
HANGUL JONGSEONG NIEUN. But is it a character or not?
No, you're confusing Unicode (a sequence of code points) with specific
encodings such as UTF-8 and UTF-16. The first is downwards compatible
with Latin-1: the values from 128 to 255 are the same. In UTF-8 they're
not. At runtime, Erlang's strings are just plain sequences of Unicode
code points (you can think of it as UTF-32 if you like). Whether the
source code is encoded in UTF-8 or Latin-1 or any other encoding is
irrelevant as long as the compiler knows how to transform the input to
the single-codepoint representation.
For example, reversing a Unicode string is a bad idea anyway because it
could contain combining characters, and reversing the order of the
codepoints in that case will create an illegal string. But an expression
like lists:reverse("a∞b") will be working on the list [97, 8734, 98]
(once the compiler has been extended to accept other encodings than
Latin-1), not the list [97,226,136,158,98], so it will produce the
intended "b∞a". This string might then become encoded as UTF-8 on its
way to your terminal, but that's another story.
/Richard
Can you go further and say that it actually *is* UTF-32? A footnote
like "[*] Basically, UTF-32; see ref XYZ for details" might be
helpful.
-michael turner
The tricky thing is that if I enter a string containing " ́e" in my module and later reverse it, I will get "é" and not "e ́" as a final result. What was initially [16#0301,16#0065] gets reversed into [16#0065,16#0301], which is not the same as the correct visual representation " ́e" (represented as ([16#0065, $ , 16#0301]), with an implicit space in there)
/Richard
I would expect, as a user of some string data type or bytestring that
claims to support unicode, that reversing a string with the characters "
́e" would give me "e ́". Single code point representation or not.
The concept of cluster has to be understood for it to make sense.
Regarding your latest post (I received it while writing this one).
Cursed be the problem of multiple environments. This is never going to
be easy to figure out!
It doesn't. That's what I said: "reversing a Unicode string is a bad
idea anyway because it could contain combining characters". But I also
clarified that on the Erlang level, at runtime, strings will contain
single code points rather than a UTF-8 encoded byte sequence, so for the
particular example of "a∞b" it happens to work. Nothing more, nothing less.
> I would expect, as a user of some string data type or bytestring that
> claims to support unicode, that reversing a string with the characters "
> ́e" would give me "e ́". Single code point representation or not.
Yes. That's why there needs to be a new Unicode-aware string library.
Operating directly on lists (e.g. using lists:reverse/1, or even
length/1) is always going to have surprising effects, and the old
'string' module in stdlib probably can't be modernized while maintaining
backwards compatibility.
> The concept of cluster has to be understood for it to make sense.
Grapheme clusters are actually one of the things you don't need to think
too much about unless you're writing an editor or similar where you need
to figure out between which code points to move the cursor or select a
sequence of code points based on what the user points to on the screen.
Combining characters are a much more basic thing and need to be
understood by pretty much anyone working with Unicode.
On 07/31/2012 01:48 PM, Ian wrote:No, you're confusing Unicode (a sequence of code points) with specific encodings such as UTF-8 and UTF-16. The first is downwards compatible with Latin-1: the values from 128 to 255 are the same. In UTF-8 they're not. At runtime, Erlang's strings are just plain sequences of Unicode code points (you can think of it as UTF-32 if you like). Whether the source code is encoded in UTF-8 or Latin-1 or any other encoding is irrelevant as long as the compiler knows how to transform the input to the single-codepoint representation.
<< A "string" is a list of integers where the integers
represent Unicode codepoints. >>
I think this is technically correct, but it is very confusing because it
implies that the source file may be encoded as unicode.
As I understand it, source files are always treated as being in Latin-1.
This means that string literals are lists of Latin-1 values, and not
lists of unicode codepoints. (The values from 128 to 255 have
different/no meanings, and values > 255 will not happen).
For example, reversing a Unicode string is a bad idea anyway because it could contain combining characters, and reversing the order of the codepoints in that case will create an illegal string. But an expression like lists:reverse("a∞b") will be working on the list [97, 8734, 98] (once the compiler has been extended to accept other encodings than Latin-1...
...), not the list [97,226,136,158,98], so it will produce the intended "b∞a". This string might then become encoded as UTF-8 on its way to your terminal, but that's another story.
On 08/01/2012 12:52 AM, CGS wrote:Ah, but when you say "give as parameter" you mean "pass it a string literal from the shell", right?
Actually, try this:
1. set your environment to UTF-8 (in my case, whatever Linux terminal
with BASH environment, export LANG="en_US.utf8", use locale to find your
environment language definition - "en_US.latin1" for LATIN-1)
2. in a module:
test_reverse(String) -> lists:reverse(String).
3. Give as parameter the example given by yourself.
4. Check the output.
I never said anything about strings in the shell - that's a different environment from source files, and as you described, the shell nowadays detects your locale and translates UTF-8 console input into a string literal containing Unicode code points.
This is exactly how it would happen in source code as well, if the compiler only knew how to detect that a source file is in a different encoding from Latin1. So the compiler is really the main thing that needs to be fixed, and then there should be no surprises on the encoding level anymore.
> There is no byte sequence valid in UTF-8 that is not also
> valid in Latin-1.
This is incorrect.
Latin-1 code points are a subset of Unicode codepoints.
Codepoints are not bytes. Codepoints are indexes in character tables. latin-1 is a table of a possible 256 characters where as Unicode is at this point a table of more that 100,000 characters. There are actually codepoints in the range of 127-159 which are unused and if used are technically invalid Latin-1 and Unicode.
When it comes to the binary representation of these codepoints. Latin-1 is encoded as literal bytes because all codepoints are less than 256. Unicode codepoints on the other hand can be larger than 255 so in order to represent them as bytes they need to be encoded.
Latin-1 bytes larger than 126 are not the same character in UTF-8 because UTF-8 uses the 8th bit for encoding multi byte sequences to represent Unicode codepoints which are larger than 126. So while values in a list greater than 126 are valid Latin-1, if those values represent UTF-8 bytes, the characters are not the same.
For instance, 233 is the codepoint for an accented e in Latin-1 and Unicode, the binary representation of that character in Latin-1 is literally the byte <<233>> but when the codepoint is encoded as UTF-8, it is the bytes <<195,169>>.
The list [195,169] is never going to be an accented e in Erlang because as far as Erlang is concerned, that is a list of Latin-1 codepoints which are the characters à and ©. Ever see Café on a webpage? That is because they told the browser that their HTML was latin-1 when it was actually UTF-8.
It just so happens that [195,169] is also of type chardata() because all valid latin-1 codepoints are also valid Unicode codepoints. In either case, [195,169] is not an accented e. At the very least it is a list of integers whose values represent UTF-8 encoded bytes but until you convert those UTF-8 bytes to Unicode codepoints it'll never be chardata() with the correct characters.
To summarize: Unicode is a table of codepoints. A codepoint is an index in the table. UTF-8 is a codec for turning codepoints to and from bytes. UTF-8 cannot be used to refer to what Erlang calls chardata(). chardata() is a list of integer() whose value is a valid Unicode codepoint. UTF-8 can only refer to a sequence of bytes.
Eric.
So currently if you encode your source code as UTF-8, string literals become the literal byte sequences. This is different than in the shell where string literals get automatically turned into their Unicode codepoints.
It appears that the solution is a compiler flag that tells the compiler that string literals should be decoded as UTF-8. So when the compiler reads the byte sequence 16#C3A9 it knows that it should be a 233 in the list because 16#C3A9 is the UTF-8 encoded sequence for the codepoint 233.
It don't know what the overall support the chardata() and charlist() is in the standard lib so doing this may cause many headaches when someone tries to stuff a charlist() where a iolist() goes or chardata() where a string() goes. This may introduce subtle bugs that only occur when non-latin-1 characters are used.
Eric.
Sorry. I took your statement out of context.
I'm working on a 2'nd edition of my book, and have got to strings :-)
Strings confuse everybody, including me so I have a few questions:
To start with Erlang doesn't have strings - it has lists (not strings)
and it has string literals.
I want to define a string - is this correct:
<< A "string" is a list of integers where the integers
represent Unicode codepoints. >>
Questions:
Is the sentence inside << .. >> using the correct terminology?
If not what should it say?
Is the sentence inside << ... >> widely understood, do you think this
would confuse a lot of people?
Is the phrase "string literal" widely understood?
Cheers
/Joe
I'm working on a 2'nd edition of my book, and have got to strings :-)
Strings confuse everybody, including me so I have a few questions:
To start with Erlang doesn't have strings - it has lists (not strings)
and it has string literals.
I want to define a string - is this correct:
<< A "string" is a list of integers where the integers
represent Unicode codepoints. >>
Questions:
Is the sentence inside << .. >> using the correct terminology?
If not what should it say?
Is the sentence inside << ... >> widely understood, do you think this
would confuse a lot of people?
Is the phrase "string literal" widely understood?
Cheers
/Joe
A "string" is an arbitrary representation of some original binary stream that has been stripped of any information about its encoding format - whether this was a direct transformation or not. No wonder nobody can agree.