[erlang-questions] correct terminology for referring to strings

120 views
Skip to first unread message

Joe Armstrong

unread,
Jul 31, 2012, 5:24:05 AM7/31/12
to Erlang
I'm working on a 2'nd edition of my book, and have got to strings :-)
Strings confuse everybody, including me so I have a few questions:

To start with Erlang doesn't have strings - it has lists (not strings)
and it has string literals.

I want to define a string - is this correct:

<< A "string" is a list of integers where the integers
represent Unicode codepoints. >>

Questions:
Is the sentence inside << .. >> using the correct terminology?
If not what should it say?

Is the sentence inside << ... >> widely understood, do you think this
would confuse a lot of people?

Is the phrase "string literal" widely understood?


Cheers

/Joe
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Paul Barry

unread,
Jul 31, 2012, 5:41:48 AM7/31/12
to Joe Armstrong, Erlang
Hi Joe.

I think "string literal" is pretty widely understood (it even has a
WikiPedia entry, here: http://en.wikipedia.org/wiki/String_literal).

What threw me about your sentence was the use of the word 'codepoint',
which will be OK for those already familiar with Unicode, but might
confuse those who are not. My feeling (and this might be a gross
over-simplification) is that most North-American programmers know
about Unicode but don't let it worry them too much, resulting in less
of a familiarity with it than might be necessary (and I apologize to
any North-American programmers that this comment rubs the wrong way).
Perhaps "unicode characters" might be easier to read/understand?
Although not probably totally technically correct...

Another thing that you might wish to consider is breaking the sentence
in two, as follows:

<< An Erlang "string" is simply a list of integers. Each integer can
represent any Unicode codepoint/character. >>

Just my 2 cent.

Paul.
--
Paul Barry, w: http://paulbarry.itcarlow.ie - e: paul....@itcarlow.ie
Lecturer, Computer Networking: Institute of Technology, Carlow, Ireland.

Michel Rijnders

unread,
Jul 31, 2012, 5:51:50 AM7/31/12
to Joe Armstrong, Erlang
On Tue, Jul 31, 2012 at 11:24 AM, Joe Armstrong <erl...@gmail.com> wrote:
> I'm working on a 2'nd edition of my book, and have got to strings :-)
> Strings confuse everybody, including me so I have a few questions:
>
> To start with Erlang doesn't have strings - it has lists (not strings)
> and it has string literals.
>
> I want to define a string - is this correct:
>
> << A "string" is a list of integers where the integers
> represent Unicode codepoints. >>
>
> Questions:
> Is the sentence inside << .. >> using the correct terminology?

Is the sentence refering to "strings" in Erlang or to strings in general?

For the first I prefer:
<< A "string" is represented by a list of integers, where the integers
are Unicode codepoints.>>

For the latter:
<< A "string" is a sequence of characters. >>

> If not what should it say?
>
> Is the sentence inside << ... >> widely understood, do you think this
> would confuse a lot of people?
>
> Is the phrase "string literal" widely understood?
>
>
> Cheers
>
> /Joe
> _______________________________________________
> erlang-questions mailing list
> erlang-q...@erlang.org
> http://erlang.org/mailman/listinfo/erlang-questions



--
My other car is a cdr.

Michael Turner

unread,
Jul 31, 2012, 5:53:08 AM7/31/12
to Paul Barry, Erlang
> << An Erlang "string" is simply a list of integers. Each integer can
> represent any Unicode codepoint/character. >>

Except that Unicode codepoints represents characters, right? You can't
have a representation of a representation.[*]

I suggest:

<< In Erlang, strings are represented as lists of integers. These
integers are Unicode codepoints, each representing a character. >>

That way, anybody who's unclear on what "codepoint" means gets a
freebie definition of it. In the Unicode context, it's probably wrong,
technically, but perhaps good enough for this purpose.

-michael turner

[*] Douglas Hofstadter might beg to differ, but he's not on this list.

Ian

unread,
Jul 31, 2012, 7:48:31 AM7/31/12
to erlang-q...@erlang.org
On 31/07/2012 10:24, Joe Armstrong wrote:
> I'm working on a 2'nd edition of my book, and have got to strings :-)
> Strings confuse everybody, including me so I have a few questions:
>
> To start with Erlang doesn't have strings - it has lists (not strings)
> and it has string literals.
>
> I want to define a string - is this correct:
>
> << A "string" is a list of integers where the integers
> represent Unicode codepoints. >>

I think this is technically correct, but it is very confusing because it
implies that the source file may be encoded as unicode.

As I understand it, source files are always treated as being in Latin-1.
This means that string literals are lists of Latin-1 values, and not
lists of unicode codepoints. (The values from 128 to 255 have
different/no meanings, and values > 255 will not happen).

If you encode your source as something other than Latin-1, the result is
a miss-coding of your literal string, with all the problems that
presents (vanishing and reappearing characters, wrong lengths etc.).

The REPL does take notice of the locale and so can produce different
results from the same source strings!

I don't envy you the task of writing something that is clear, correct,
concise and comprehensible. That will be a challenge!

Regards

Ian

Masklinn

unread,
Jul 31, 2012, 8:44:34 AM7/31/12
to Erlang
On 2012-07-31, at 11:53 , Michael Turner wrote:

>> << An Erlang "string" is simply a list of integers. Each integer can
>> represent any Unicode codepoint/character. >>
>
> Except that Unicode codepoints represents characters, right?

Depends, the definition of "character" is quite ambiguous.

By "character", many people mean what Unicode calls "grapheme" (a concrete
shape or shape-group displayed on a medium[-1]). The meaning of the word may
also change across cultures, for instance concerning diacritics some
cultures consider the base+diacritic(s) as a single character, others as
multiples. And it becomes very tought to define for e.g. hangul, is the
character a hangul block or the jamo composing it?[0]

The Unicode Standard itself lists 4 different and potentially
incompatible meanings for the word:

(1) The smallest component of written language that has semantic value;
refers to the abstract meaning and/or shape, rather than a specific
shape (see also glyph), though in code tables some form of visual
representation is essential for the reader’s understanding.
(2) Synonym for abstract character.
(3) The basic unit of encoding for the Unicode character encoding.
(4) The English name for the ideographic written elements of Chinese origin

where "abstract character" is defined as:

A unit of information used for the organization, control, or
representation of textual data.
* When representing data, the nature of that data is generally symbolic
as opposed to some other kind of data (for example, aural or visual).
Examples of such symbolic data include letters, ideographs, digits,
punctuation, technical symbols, and dingbats.
* An abstract character has no concrete form and should not be confused
with a glyph.
* An abstract character does not necessarily correspond to what a user
thinks of as a “character” and should not be confused with a grapheme.
* The abstract characters encoded by the Unicode Standard are known as
Unicode abstract characters.
* Abstract characters not directly encoded by the Unicode Standard can
often be represented by the use of combining character sequences.

In most meanings of the word "character", a character maps to a
(potentially unary) *sequence* of unicode code-points, there isn't a
1:1 mapping.

[-1] don't take my word for it, I might have fucked up my recollection,
I regularly get confused between the precise meanings of "glyph"
and "grapheme"

[0] Modern hangul is written as syllabic blocks, each block is composed
of three jamo (letters, technically 2 to 5 for ancient/historical
texts). For instance 한 (han) is a block composed of three jamo ㅎ,
ㅏ, and ㄴ, and unicode allows encoding it as either HANGUL SYLLABLE
HAN or a sequence of HANGUL CHOSEONG HIEUH, HANGUL JUNGSEONG A and
HANGUL JONGSEONG NIEUN. But is it a character or not?

Richard Carlsson

unread,
Jul 31, 2012, 10:04:05 AM7/31/12
to erlang-q...@erlang.org
On 07/31/2012 01:48 PM, Ian wrote:
>> << A "string" is a list of integers where the integers
>> represent Unicode codepoints. >>
>
> I think this is technically correct, but it is very confusing because it
> implies that the source file may be encoded as unicode.
>
> As I understand it, source files are always treated as being in Latin-1.
> This means that string literals are lists of Latin-1 values, and not
> lists of unicode codepoints. (The values from 128 to 255 have
> different/no meanings, and values > 255 will not happen).

No, you're confusing Unicode (a sequence of code points) with specific
encodings such as UTF-8 and UTF-16. The first is downwards compatible
with Latin-1: the values from 128 to 255 are the same. In UTF-8 they're
not. At runtime, Erlang's strings are just plain sequences of Unicode
code points (you can think of it as UTF-32 if you like). Whether the
source code is encoded in UTF-8 or Latin-1 or any other encoding is
irrelevant as long as the compiler knows how to transform the input to
the single-codepoint representation.

For example, reversing a Unicode string is a bad idea anyway because it
could contain combining characters, and reversing the order of the
codepoints in that case will create an illegal string. But an expression
like lists:reverse("a∞b") will be working on the list [97, 8734, 98]
(once the compiler has been extended to accept other encodings than
Latin-1), not the list [97,226,136,158,98], so it will produce the
intended "b∞a". This string might then become encoded as UTF-8 on its
way to your terminal, but that's another story.

/Richard

Michael Turner

unread,
Jul 31, 2012, 10:16:00 AM7/31/12
to Masklinn, Erlang
On Tue, Jul 31, 2012 at 9:44 PM, Masklinn <mask...@masklinn.net> wrote:
> On 2012-07-31, at 11:53 , Michael Turner wrote:
>
>>> << An Erlang "string" is simply a list of integers. Each integer can
>>> represent any Unicode codepoint/character. >>
>>
>> Except that Unicode codepoints represents characters, right?
>
> Depends, the definition of "character" is quite ambiguous.

I think you're right, and if Joe goes with something like my wording
(which is not included in what you quote above), what you say below
should really be condensed into a footnote that refers to a more
complete and accurate discussion. E.g., "[*] In its fullest
generality, it's not quite that simple, since codepoints in some
writing systems can actually refer to *parts* of what most Westerners
might think of as a single 'character'; see XYZ for a more detailed
discussion."

[snip]

-michael turner

Michael Turner

unread,
Jul 31, 2012, 10:19:49 AM7/31/12
to Richard Carlsson, erlang-q...@erlang.org
> At runtime, Erlang's strings are just plain sequences of Unicode code points
> (you can think of it as UTF-32 if you like).

Can you go further and say that it actually *is* UTF-32? A footnote
like "[*] Basically, UTF-32; see ref XYZ for details" might be
helpful.

-michael turner

Richard Carlsson

unread,
Jul 31, 2012, 10:37:10 AM7/31/12
to Michael Turner, erlang-q...@erlang.org
On 07/31/2012 04:19 PM, Michael Turner wrote:
>> At runtime, Erlang's strings are just plain sequences of Unicode code points
>> (you can think of it as UTF-32 if you like).
>
> Can you go further and say that it actually *is* UTF-32? A footnote
> like "[*] Basically, UTF-32; see ref XYZ for details" might be
> helpful.

I'm loath to say that it *is* UTF-32, because with that term follows a
bunch of connotations such as word width and endianism, which don't
apply to the representation as Erlang integers. I'd like to just refer
to it as Unicode, but apparently that makes most people think it's
either UTF-8 or UTF-16.

Masklinn

unread,
Jul 31, 2012, 11:13:01 AM7/31/12
to erlang-questions Questions
On 2012-07-31, at 16:37 , Richard Carlsson wrote:

> On 07/31/2012 04:19 PM, Michael Turner wrote:
>>> At runtime, Erlang's strings are just plain sequences of Unicode code points
>>> (you can think of it as UTF-32 if you like).
>>
>> Can you go further and say that it actually *is* UTF-32? A footnote
>> like "[*] Basically, UTF-32; see ref XYZ for details" might be
>> helpful.
>
> I'm loath to say that it *is* UTF-32, because with that term follows a bunch of connotations such as word width and endianism, which don't apply to the representation as Erlang integers. I'd like to just refer to it as Unicode, but apparently that makes most people think it's either UTF-8 or UTF-16.

Say it's a sequence of code points (reified as integers)? That's exactly
what it is. If people don't know what a code point is, they can look it
up. In any case, this shouldn't bring along any undue semantic baggage
and misconception.

Fred Hebert

unread,
Jul 31, 2012, 11:19:23 AM7/31/12
to Richard Carlsson, erlang-q...@erlang.org
Even then the reversal is not guaranteed.

The character 'é' can be represented, for example, in two ways:

é = U+00E9
e+ ́ = U+0065 + U+0301

The first one allows a representation as a single codepoint, but the second one is a 'grapheme cluster', a sequence of codepoints representing a single grapheme, a single unit of text. Grapheme clusters can be larger than two elements, and as far as I know, you cannot reverse them. The cluster should ideally remain in the same order in the reversed string:

2> io:format("~ts~n",[[16#0065,16#0301]]).

ok
3> io:format("~ts~n",[[16#0301,16#0065]]).
 ́e
ok

This is fine with your plan -- if I force a single code point representation, this is a non-issue.

The tricky thing is that if I enter a string containing " ́e" in my module and later reverse it, I will get "é" and not "e ́" as a final result. What was initially [16#0301,16#0065] gets reversed into [16#0065,16#0301], which is not the same as the correct visual representation " ́e" (represented as ([16#0065, $ , 16#0301]), with an implicit space in there)

 It works one way (starting the right direction then reversing), but without being very careful, it can break when going the other way (starting with two non-combined code points that get assembled in the same cluster when reversed).

Just changing to single code point representations isn't enough to make sure nothing is broken.

Fred Hebert

unread,
Jul 31, 2012, 11:24:34 AM7/31/12
to Richard Carlsson, erlang-q...@erlang.org

On 12-07-31 11:19 AM, Fred Hebert wrote:
The tricky thing is that if I enter a string containing " ́e" in my module and later reverse it, I will get "é" and not "e ́" as a final result. What was initially [16#0301,16#0065] gets reversed into [16#0065,16#0301], which is not the same as the correct visual representation " ́e" (represented as ([16#0065, $ , 16#0301]), with an implicit space in there)

Quick note that the last " ́e" in there  (possibly in red, depending of your mail client) should have been "e ́", which is represented as [16#0065, $ , 16#0301]. Sorry for the confusion.

Jan Burse

unread,
Jul 31, 2012, 11:40:46 AM7/31/12
to erlang-q...@erlang.org
Masklinn schrieb:
> Say it's a sequence of code points (reified as integers)? That's exactly
> what it is. If people don't know what a code point is, they can look it
> up. In any case, this shouldn't bring along any undue semantic baggage
> and misconception.
>

If they are code points, there needs to be a reference to
the Unicode version (4.0 or 6.0 etc..), a clarification whether
on a specific platforms private codes are supported (i.e. apple
sign on mac), a clarfication which planes are supported (basic
plane only or supplementary planes also or UCS etc..).

http://en.wikipedia.org/wiki/Universal_Character_Set

In the ISO Core Standard for Prolog (ISO/IEC 13211-1) the problem
is simply solved as follows:

The processor character set PCS is an implementation
defined character set. The members of PCS shall include
each character defined by char (6.5).

PCS may include additional members, known as extended
characters. It shall be implementation defined for each
extended character whether it is a graphic char, or an
alphanumeric char, or a solo char, or a layout char, or a
meta char.

char (* 6.5 *)
= graphic char (* 6.5.1 *)
alphanumeric char (* 6.5.2 *)
solo char (* 6.5.3 l )
layout char (* 65.4 *)
meta char (* 6.5.5 *) ;

Means the standard does not know about a Unicode extension. But
it requires that in a Unicode extensions at least one can deal with
the same minimal subset unchanged, and all else is implementation
specific, i.e. Prolog system specific. Whereby even the subset
is not specified exactly what coding it is, we only have:

NOTE - These requirements on the collating sequence are
satisfied by both ASCII and EBCDIC.

What the standard not did forsee was that there could be different
stream encodings on the same processor. So although we have already
in the standard:

NOTE - A character code may correspond to more than
one byte in a stream. Thus, inputting a single character
may consume several bytes from an input stream, and writing
a single character may output several bytes to an output stream.

The current practice is that many Prolog systems offer an encoding/1
option in the stream handling, although no corrigenda has yet
picked that up. See for example SWI Prolog:


http://www.swi-prolog.org/pldoc/doc_for?object=section%282,%272.18%27,swi%28%27/doc/Manual/widechars.html%27%29%29

Bye

Richard Carlsson

unread,
Jul 31, 2012, 11:50:11 AM7/31/12
to Fred Hebert, erlang-q...@erlang.org
If you take another look at what I wrote, this is precisely what I was
talking about. But you are confusing grapheme clusters with combining
characters; they are not the same thing. A grapheme cluster is the next
higher conceptual level, and a cluster could consist of multiple
characters, each of which could be individually made up of a base
character (such as "e") plus one or more combining characters (like
U+0301 COMBINING ACUTE ACCENT).

/Richard

CGS

unread,
Jul 31, 2012, 12:00:56 PM7/31/12
to Fred Hebert, erlang-q...@erlang.org
Actually, that depends on the environment you are working in, I suppose. Testing in my local environment, I got this:

1> io:format("~ts~n",[[16#00E9]]).        
é
ok
2> io:format("~ts~n",[[16#0065,16#0301]]).
e ́
ok
3> io:format("~ts~n",[[16#0301,16#0065]]). 
 ́e
ok

CGS





Fred Hebert

unread,
Jul 31, 2012, 12:03:01 PM7/31/12
to Richard Carlsson, erlang-q...@erlang.org
Your post seemed to imply that converting to single code point
representation is good enough. I do not understand how that distinction
solves the problem of string reversal as I wrote it here, though.

I would expect, as a user of some string data type or bytestring that
claims to support unicode, that reversing a string with the characters "
́e" would give me "e ́". Single code point representation or not.

The concept of cluster has to be understood for it to make sense.

Regarding your latest post (I received it while writing this one).
Cursed be the problem of multiple environments. This is never going to
be easy to figure out!

Richard Carlsson

unread,
Jul 31, 2012, 12:49:07 PM7/31/12
to Fred Hebert, erlang-q...@erlang.org
On 07/31/2012 06:03 PM, Fred Hebert wrote:
> Your post seemed to imply that converting to single code point
> representation is good enough. I do not understand how that distinction
> solves the problem of string reversal as I wrote it here, though.

It doesn't. That's what I said: "reversing a Unicode string is a bad
idea anyway because it could contain combining characters". But I also
clarified that on the Erlang level, at runtime, strings will contain
single code points rather than a UTF-8 encoded byte sequence, so for the
particular example of "a∞b" it happens to work. Nothing more, nothing less.

> I would expect, as a user of some string data type or bytestring that
> claims to support unicode, that reversing a string with the characters "
> ́e" would give me "e ́". Single code point representation or not.

Yes. That's why there needs to be a new Unicode-aware string library.
Operating directly on lists (e.g. using lists:reverse/1, or even
length/1) is always going to have surprising effects, and the old
'string' module in stdlib probably can't be modernized while maintaining
backwards compatibility.

> The concept of cluster has to be understood for it to make sense.

Grapheme clusters are actually one of the things you don't need to think
too much about unless you're writing an editor or similar where you need
to figure out between which code points to move the cursor or select a
sequence of code points based on what the user points to on the screen.
Combining characters are a much more basic thing and need to be
understood by pretty much anyone working with Unicode.

CGS

unread,
Jul 31, 2012, 6:52:15 PM7/31/12
to Richard Carlsson, erlang-q...@erlang.org
On Tue, Jul 31, 2012 at 4:04 PM, Richard Carlsson <carlsson...@gmail.com> wrote:
On 07/31/2012 01:48 PM, Ian wrote:
<< A "string" is a list of integers where the integers
       represent Unicode codepoints. >>

I think this is technically correct, but it is very confusing because it
implies that the source file may be encoded as unicode.

As I understand it, source files are always treated as being in Latin-1.
This means that string literals are lists of Latin-1 values, and not
lists of unicode codepoints. (The values from 128 to 255 have
different/no meanings, and values > 255 will not happen).

No, you're confusing Unicode (a sequence of code points) with specific encodings such as UTF-8 and UTF-16. The first is downwards compatible with Latin-1: the values from 128 to 255 are the same. In UTF-8 they're not. At runtime, Erlang's strings are just plain sequences of Unicode code points (you can think of it as UTF-32 if you like). Whether the source code is encoded in UTF-8 or Latin-1 or any other encoding is irrelevant as long as the compiler knows how to transform the input to the single-codepoint representation.

For example, reversing a Unicode string is a bad idea anyway because it could contain combining characters, and reversing the order of the codepoints in that case will create an illegal string. But an expression like lists:reverse("a∞b") will be working on the list [97, 8734, 98] (once the compiler has been extended to accept other encodings than Latin-1...

Actually, try this:

1. set your environment to UTF-8 (in my case, whatever Linux terminal with BASH environment, export LANG="en_US.utf8", use locale to find your environment language definition - "en_US.latin1" for LATIN-1)
2. in a module:

test_reverse(String) -> lists:reverse(String).

3. Give as parameter the example given by yourself.
4. Check the output.

Pretty interesting to see how Erlang "knows" about UTF-8 encoding, isn't it? (You can try directly in the shell lists:reverse("a∞b") and it will transform as expected (using 3-elements list).) Actually, it knows nothing about, but relying on the environment to extract the integers for the list (which it mimics here the knowledge about UTF-8).

...), not the list [97,226,136,158,98], so it will produce the intended "b∞a". This string might then become encoded as UTF-8 on its way to your terminal, but that's another story.

I would add to the last part ("on its way to your terminal") also "from" and not leaving only "on" (it seems that the both ways are valid even if that can break the code).

I agree that for string literals, what you said is always true.
 
CGS

Michael Turner

unread,
Jul 31, 2012, 8:51:21 PM7/31/12
to Masklinn, erlang-questions Questions
On Wed, Aug 1, 2012 at 12:13 AM, Masklinn <mask...@masklinn.net> wrote:
> On 2012-07-31, at 16:37 , Richard Carlsson wrote:
>
>> On 07/31/2012 04:19 PM, Michael Turner wrote:
>>>> At runtime, Erlang's strings are just plain sequences of Unicode code points
>>>> (you can think of it as UTF-32 if you like).
>>>
>>> Can you go further and say that it actually *is* UTF-32? A footnote
>>> like "[*] Basically, UTF-32; see ref XYZ for details" might be
>>> helpful.
>>
>> I'm loath to say that it *is* UTF-32, because with that term follows a bunch of connotations such as word width and endianism, which don't apply to the representation as Erlang integers. I'd like to just refer to it as Unicode, but apparently that makes most people think it's either UTF-8 or UTF-16.
>
> Say it's a sequence of code points (reified as integers)? That's exactly
> what it is.

But as Richard Carlsson points out, that's NOT quite "exactly what it is."

> ... If people don't know what a code point is, they can look it
> up. In any case, this shouldn't bring along any undue semantic baggage
> and misconception.

The perfect is often the enemy of the good. Perfect precision is
sometimes the enemy of good initial comprehension. In my experience,
that's *definitely* true of most approaches to Unicode.

I hope we haven't lost track of Joe's goal here: a reasonably accurate
description of what Erlang strings are, in a passage that, at the
very least, should not intimidate newbies. A quick and approximate
gloss of "codepoint" in the main text, together with a footnote
apologizing for the oversimplification and suggesting a more detailed
reference, strikes me as the best compromise between precision and the
need to appeal to the reader. Appealing to the reader is not exactly
optional for Erlang. Every day, I hear more about Scala and Node.js.

-michael turner

Richard O'Keefe

unread,
Aug 1, 2012, 12:33:08 AM8/1/12
to Michael Turner, Erlang

On 31/07/2012, at 9:53 PM, Michael Turner wrote:

>> << An Erlang "string" is simply a list of integers. Each integer can
>> represent any Unicode codepoint/character. >>
>
> Except that Unicode codepoints represents characters, right?

Wrong.

One Unicode codepoint may represent what a particular language
views as two distinct graphemes. (This occurs in encoding English,
for example: in 'belovéd' the diacritical mark is a stress accent
and so é counts as two separate graphemes.)

One grapheme may require two or more Unicode codepoints.
Some characters, well, 26FD FE0E is a black-and white picture
of a petrol pump, but 26FD FE0F is a colour version. Either
of these is perceived by users as a single 'character'.
FE0E represents "select text style for the previous thingy";
FEOF represents "select emoji style for it". You'd be hard
pressed to call either FE0E or FE0F a "character".

The majority of codepoints represent nothing at all (yet).

The thing people *still* don't get about Unicode is that
with ASCII and EBCDIC and Latin-1 there really was such a
thing as a "character" that a string was a sequence of, but
in Unicode, a string is *not* a sequence of characters but
a *well-formed* sequence of codepoints. You *can't* represent
the "emoji style FUEL PUMP" <<character>> by a single number,
only by a *sequence* of codepoints.

I keep meaning to write a small book called "Strings Made Difficult."

From the Unicode FAQ:

Q: So is a combining character sequence the same as a “character”?

A: That depends. For a programmer, a Unicode code value represents a single character (for exceptions, see below). For an end user, it may not. The better word for what end-users think of as characters is grapheme (as defined in the Unicode glossary): a minimally distinctive unit of writing in the context of a particular writing system.

...
Q: How should characters (particularly composite characters) be counted, for the purposes of length, substrings, positions in a string, etc.?

A: In general, there are 3 different ways to count characters. Each is illustrated with the following sample string.
“a” + umlaut + greek_alpha + \uE0000.
(the latter is a private use character)

1. Code Units: e.g. how many bytes are in the physical representation of the string. Example:
In UTF-8, the sample has 9 bytes. [61 CC 88 CE B1 F3 A0 80 80]
In UTF-16BE, it has 10 bytes. [00 61 03 08 03 B1 DB 40 DC 00]
In UTF-32BE, it has 16 bytes. [00 00 00 61 00 00 03 08 00 00 03 B1 00 0E 00 00]

2. Codepoints: how may code points are in the string.
The sample has 4 code points. This is equivalent to the UTF-32BE count divided by 4.

3. Graphemes: what end-users consider as characters.
A default grapheme cluster is specified in UAX #29, Unicode Text Segmentation, as well as in UTS #18, Unicode Regular Expressions.

The choice of which one to use depends on the tradeoffs between efficiency and comprehension. For example, Java, Windows and ICU use #1 with UTF-16 for all low-level string operations, and then also supply layers above that provide for #2 and #3 boundaries when circumstances require them. This approach allows for efficient processing, with allowance for higher-level usage. However, for a very high level application, such as word-processing macros, graphemes alone will probably be sufficient.

Q

Richard O'Keefe

unread,
Aug 1, 2012, 12:45:15 AM8/1/12
to Richard Carlsson, erlang-q...@erlang.org

On 1/08/2012, at 4:49 AM, Richard Carlsson wrote:

> Yes. That's why there needs to be a new Unicode-aware string library. Operating directly on lists (e.g. using lists:reverse/1, or even length/1) is always going to have surprising effects, and the old 'string' module in stdlib probably can't be modernized while maintaining backwards compatibility.

It should be noted that the "length" of a Unicode string is an inherently
ambiguous concept anyway. It makes sense to ask how many codepoints
there are in the string as given, or how many there would be in a particular
normalisation form, but neither of those is how many characters the *user*
would count (which is not just script-sensitive, not just locale-sensitive,
but very context-sensitive). Oh, and none of them is the same as "how many
columns would this require in a fixed-width font."

Michael Turner

unread,
Aug 1, 2012, 3:32:57 AM8/1/12
to Richard O'Keefe, Erlang
On Wed, Aug 1, 2012 at 1:33 PM, Richard O'Keefe <o...@cs.otago.ac.nz> wrote:
>
> On 31/07/2012, at 9:53 PM, Michael Turner wrote:
>
>>> << An Erlang "string" is simply a list of integers. Each integer can
>>> represent any Unicode codepoint/character. >>
>>
>> Except that Unicode codepoints represents characters, right?
>
> Wrong.

Actually, what's *really* wrong in my statement is the grammar -- bad
plural agreement.

> One Unicode codepoint may represent what a particular language
> views as two distinct graphemes. (This occurs in encoding English,
> for example: in 'belovéd' the diacritical mark is a stress accent
> and so é counts as two separate graphemes.)

[snip much more]

I'm certain this is correct, Richard, but ... what problem are we
trying to solve again? IIRC: Joe is trying to come up with a short
passage that explains what strings are, in Erlang. If he writes all
that you wrote above, the reader (who might have been initially
excited about Erlang) will come away with the impression, "Erlang
people are excruciatingly pedantic".

> I keep meaning to write a small book called "Strings Made Difficult."

Sounds like you're the man to do it, Richard.

As I wrote earlier:

>> << In Erlang, strings are represented as lists of integers. These
>> integers are Unicode codepoints, each representing a character. >>
>>
>> That way, anybody who's unclear on what "codepoint" means gets a
>> freebie definition of it. In the Unicode context, it's probably wrong,
>> technically, but perhaps good enough for this purpose.

Can anyone tell me why this *wouldn't* serve Joe's (== the typical
reader's) purposes? [*]

-michael turner

[*] Why do I suspect we're now going to have a long digression on
whether the "==" in "(== the typical reader's)" should really be
"=:="?

Richard Carlsson

unread,
Aug 1, 2012, 4:39:07 AM8/1/12
to CGS, erlang-q...@erlang.org
On 08/01/2012 12:52 AM, CGS wrote:
> Actually, try this:
>
> 1. set your environment to UTF-8 (in my case, whatever Linux terminal
> with BASH environment, export LANG="en_US.utf8", use locale to find your
> environment language definition - "en_US.latin1" for LATIN-1)
> 2. in a module:
>
> test_reverse(String) -> lists:reverse(String).
>
> 3. Give as parameter the example given by yourself.
> 4. Check the output.

Ah, but when you say "give as parameter" you mean "pass it a string
literal from the shell", right? I never said anything about strings in
the shell - that's a different environment from source files, and as you
described, the shell nowadays detects your locale and translates UTF-8
console input into a string literal containing Unicode code points. This
is exactly how it would happen in source code as well, if the compiler
only knew how to detect that a source file is in a different encoding
from Latin1. So the compiler is really the main thing that needs to be
fixed, and then there should be no surprises on the encoding level anymore.

Ian

unread,
Aug 1, 2012, 5:52:28 AM8/1/12
to erlang-q...@erlang.org
On 01/08/2012 09:39, Richard Carlsson wrote:

> [the shell is] a different environment from source files, and as you
> described, the shell nowadays detects your locale and translates UTF-8
> console input into a string literal containing Unicode code points.
> This is exactly how it would happen in source code as well, if the
> compiler only knew how to detect that a source file is in a different
> encoding from Latin1. So the compiler is really the main thing that
> needs to be fixed, and then there should be no surprises on the
> encoding level anymore.
Anyone any idea about timescales for making the compiler aware of the
environment?

Ian

CGS

unread,
Aug 1, 2012, 8:25:50 AM8/1/12
to Richard Carlsson, erlang-q...@erlang.org
On Wed, Aug 1, 2012 at 10:39 AM, Richard Carlsson <carlsson...@gmail.com> wrote:
On 08/01/2012 12:52 AM, CGS wrote:
Actually, try this:

1. set your environment to UTF-8 (in my case, whatever Linux terminal
with BASH environment, export LANG="en_US.utf8", use locale to find your
environment language definition - "en_US.latin1" for LATIN-1)
2. in a module:

test_reverse(String) -> lists:reverse(String).

3. Give as parameter the example given by yourself.
4. Check the output.

Ah, but when you say "give as parameter" you mean "pass it a string literal from the shell", right?

Yes. Sorry for the confusion.
 
I never said anything about strings in the shell - that's a different environment from source files, and as you described, the shell nowadays detects your locale and translates UTF-8 console input into a string literal containing Unicode code points.

The shell was just an example which is a part of the problem. Another part is the communication over HTTP/HTTPS protocol. And there may be some other examples. My point here is the interaction of Erlang with any environment which uses UTF-8. If Erlang compiler doesn't become aware of UTF-8, the code may be unstable without even understanding why. As you said, it seems that the compiler needs a fix.
 
This is exactly how it would happen in source code as well, if the compiler only knew how to detect that a source file is in a different encoding from Latin1. So the compiler is really the main thing that needs to be fixed, and then there should be no surprises on the encoding level anymore.

I couldn't agree more.
 
CGS

Thomas Lindgren

unread,
Aug 1, 2012, 2:42:54 PM8/1/12
to erlang-q...@erlang.org




----- Original Message -----
> From: Richard Carlsson <carlsson...@gmail.com>
...
> Ah, but when you say "give as parameter" you mean "pass it a
> string literal from the shell", right? I never said anything about strings
> in the shell - that's a different environment from source files, and as you
> described, the shell nowadays detects your locale and translates UTF-8 console
> input into a string literal containing Unicode code points. This is exactly how
> it would happen in source code as well, if the compiler only knew how to detect
> that a source file is in a different encoding from Latin1. So the compiler is
> really the main thing that needs to be fixed, and then there should be no
> surprises on the encoding level anymore.


How about adding compiler warnings about string literals that do not obey
the designated encoding? (There should then, of course, be multiple possibilities to choose from.)

E.g., "warn if strings not UTF8", "warn if not Latin-1", "warn if not compatible with current locale" ...

If so, it might also be nice to check strings in object files for the same properties, 
and perhaps make the shell interpreter complain too.

Or maybe you start your erl or erlc with, say, "--string-encoding=..."

Best,
Thomas

Vlad Dumitrescu

unread,
Aug 1, 2012, 2:57:23 PM8/1/12
to Thomas Lindgren, erlang-q...@erlang.org
Hi,

On Wed, Aug 1, 2012 at 8:42 PM, Thomas Lindgren
<thomasl...@yahoo.com> wrote:
> How about adding compiler warnings about string literals that do not obey
> the designated encoding? (There should then, of course, be multiple possibilities to choose from.)
>
> E.g., "warn if strings not UTF8", "warn if not Latin-1", "warn if not compatible with current locale" ...

The problem is that there aren't any invalid Latin-1 characters and
thus strings. One can guess that for example "ø" is not a meaningful
string, and that it's actually "ø", but it's not 100% reliable...

regards,
Vlad

Thomas Lindgren

unread,
Aug 1, 2012, 4:03:53 PM8/1/12
to Vlad Dumitrescu, erlang-q...@erlang.org




----- Original Message -----
> From: Vlad Dumitrescu <vlad...@gmail.com>
...
>
> On Wed, Aug 1, 2012 at 8:42 PM, Thomas Lindgren
> <thomasl...@yahoo.com> wrote:
>> How about adding compiler warnings about string literals that do not obey
>> the designated encoding? (There should then, of course, be multiple
> possibilities to choose from.)
>>
>> E.g., "warn if strings not UTF8", "warn if not
> Latin-1", "warn if not compatible with current locale" ...
>
> The problem is that there aren't any invalid Latin-1 characters and
> thus strings. One can guess that for example "ø" is not a meaningful
> string, and that it's actually "ø", but it's not 100%
> reliable...


Good catch, maybe one could check for subsets of Latin-1 or something
if so desired. I guess I'm mainly interested in UTF8 myself.

Best,
Thomas

Richard O'Keefe

unread,
Aug 1, 2012, 9:28:35 PM8/1/12
to Thomas Lindgren, erlang-q...@erlang.org

On 2/08/2012, at 6:42 AM, Thomas Lindgren wrote:
>
> How about adding compiler warnings about string literals that do not obey
> the designated encoding? (There should then, of course, be multiple possibilities to choose from.)
>

What does this actually mean?

There is no byte sequence valid in UTF-8 that is not also
valid in Latin-1. Yes, codes 128..159 are control characters,
but nobody ever said that control characters weren't legal in
strings. Checking the mappings that came with Unicode 4,
there is no byte sequence valid in UTF-8 that is not also
valid in ISO 8859-{1,2,4,5,9,10,13,14,15}, PC code pages
437, 737, 775, 850, 852, 885, 86[012356], and Apple Arabic,
Central European, Croatian, Cyrillic, Farsi, Greek, Hebrew,
Icelandic, Roman, Romanian, Squeak, and Turkish.

So I have no idea what "string literals that do not obey
the designated encoding" means or how to operationalise it.

Richard O'Keefe

unread,
Aug 1, 2012, 9:58:42 PM8/1/12
to Michael Turner, Erlang

On 1/08/2012, at 7:32 PM, Michael Turner wrote:

> On Wed, Aug 1, 2012 at 1:33 PM, Richard O'Keefe <o...@cs.otago.ac.nz> wrote:
>>
>> On 31/07/2012, at 9:53 PM, Michael Turner wrote:
>>
>>>> << An Erlang "string" is simply a list of integers. Each integer can
>>>> represent any Unicode codepoint/character. >>
>>>
>>> Except that Unicode codepoints represents characters, right?
>>
>> Wrong.
>
> Actually, what's *really* wrong in my statement is the grammar -- bad
> plural agreement.

No, what's wrong is the assertion that Unicode codepoints
represent characters.

> I'm certain this is correct, Richard, but ... what problem are we
> trying to solve again?

Having Joe not accidentally lying to his readers.


> IIRC: Joe is trying to come up with a short
> passage that explains what strings are, in Erlang. If he writes all
> that you wrote above, the reader (who might have been initially
> excited about Erlang) will come away with the impression, "Erlang
> people are excruciatingly pedantic".

Nobody ever said that >Joe< should say all that.

What he needs to say is something like
+++ Unicode is actually insanely complicated,
+++ but a good starting point is that each character
+++ named in the Unicode standard is assigned a
+++ unique integer called a codepoint,
+++ and an Erlang string is just a list of these numbers.

The phrase 'named in the standard' makes this literally true; it also
asserts that characters are given codepoints (true), not that
odepoints represent characters (false).

>>> << In Erlang, strings are represented as lists of integers. These
>>> integers are Unicode codepoints, each representing a character. >>
>>>
>>> That way, anybody who's unclear on what "codepoint" means gets a
>>> freebie definition of it. In the Unicode context, it's probably wrong,
>>> technically, but perhaps good enough for this purpose.
>
> Can anyone tell me why this *wouldn't* serve Joe's (== the typical
> reader's) purposes? [*]

Because it is dangerously wrong and misleading.
Why say what is untrue, when you can say something true that is
nearly as simple and serves the job just as well?

Eric Moritz

unread,
Aug 1, 2012, 11:18:47 PM8/1/12
to Richard O'Keefe, Erlang Questions


> There is no byte sequence valid in UTF-8 that is not also
> valid in Latin-1.

This is incorrect.

Latin-1 code points are a subset of Unicode codepoints. 

Codepoints are not bytes. Codepoints are indexes in character tables. latin-1 is a table of a possible 256 characters where as Unicode is at this point a table of more that 100,000 characters.  There are actually codepoints in the range of 127-159 which are unused and if used are technically invalid Latin-1 and Unicode.

When it comes to the binary representation of these codepoints.  Latin-1 is encoded as literal bytes because all codepoints are less than 256.  Unicode codepoints on the other hand can be larger than 255 so in order to represent them as bytes they need to be encoded.

Latin-1 bytes larger than 126 are not the same character in UTF-8 because UTF-8 uses the 8th bit for encoding multi byte sequences to represent Unicode codepoints which are larger than 126. So while values in a list greater than 126 are valid Latin-1, if those values represent UTF-8 bytes, the characters are not the same.

For instance, 233 is the codepoint for an accented e in Latin-1 and Unicode, the binary representation of that character in Latin-1 is literally the byte <<233>> but when the codepoint is encoded as UTF-8, it is the bytes <<195,169>>.

The list [195,169] is never going to be an accented e in Erlang because as far as Erlang is concerned, that is a list of Latin-1 codepoints which are the characters à and ©.  Ever see Café on a webpage? That is because they told the browser that their HTML was latin-1 when it was actually UTF-8.

It just so happens that [195,169] is also of type chardata() because all valid latin-1 codepoints are also valid Unicode codepoints. In either case, [195,169] is not an accented e. At the very least it is a list of integers whose values represent UTF-8 encoded bytes but until you convert those UTF-8 bytes to Unicode codepoints it'll never be chardata() with the correct characters.

To summarize: Unicode is a table of codepoints.  A codepoint is an index in the table. UTF-8 is a codec for turning codepoints to and from bytes. UTF-8 cannot be used to refer to what Erlang calls chardata(). chardata() is a list of integer() whose value is a valid Unicode codepoint. UTF-8 can only refer to a sequence of bytes.

Eric.

Eric Moritz

unread,
Aug 1, 2012, 11:32:51 PM8/1/12
to Richard Carlsson, Erlang Questions

So currently if you encode your source code as UTF-8, string literals become the literal byte sequences. This is different than in the shell where string literals get automatically turned into their Unicode codepoints. 

It appears that the solution is a compiler flag that tells the compiler that string literals should be decoded as UTF-8. So when the compiler reads the byte sequence 16#C3A9 it knows that it should be a 233 in the list because 16#C3A9 is the UTF-8 encoded sequence for the codepoint 233.

It don't know what the overall support the chardata() and charlist() is in the standard lib so doing this may cause many headaches when someone tries to stuff a charlist() where a iolist()  goes or chardata() where a string() goes. This may introduce subtle bugs that only occur when non-latin-1 characters are used.

Eric.

Richard O'Keefe

unread,
Aug 2, 2012, 12:39:58 AM8/2/12
to Eric Moritz, Erlang Questions

On 2/08/2012, at 3:18 PM, Eric Moritz wrote:

>
> > There is no byte sequence valid in UTF-8 that is not also
> > valid in Latin-1.
>
> This is incorrect.

Let's be pedantic here.
There is no sequence of bytes B such that
(1) B conforms to the rules of UTF-8 and
(2) B can also be decoded as Latin 1

This is 100% correct.
>
> Latin-1 code points are a subset of Unicode codepoints.

True and totally irrelevant. The statement in question has
nothing to say about codepoints.

> Codepoints are not bytes.

Also true and totally irrelevant. The statement in question
has nothing to say about codepoints.

> Codepoints are indexes in character tables. latin-1 is a table of a possible 256 characters where as Unicode is at this point a table of more that 100,000 characters. There are actually codepoints in the range of 127-159 which are unused and if used are technically invalid Latin-1 and Unicode.

I suppose it depends on what you mean by "Latin 1".
If you look at the code tables in
http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-094.pdf
you're right: 127 is not there.

But then neither are TAB, CR, or LF there.

If you want to talk about "Latin 1" in any sense that includes
those control characters, you have to admit the others.
The framework is specified by ECMA 43, which requires ESC and
DEL. So byte 127 is not invalid.
If you want TAB, CR, LF, and so on, then you get them from
ECMA 48, the C0 set. Bytes with values 128 to 159 *also* come from
ECMA 48.

So when I talk about "Latin 1" I mean all the printing characters
*and* all the ECMA C0 and C1 control characters.

It's not just me. Look for example at
http://www.madore.org/~david/computers/unicode/cstab.html#Latin-1
which shows the control character names in red.
More importantly, look at the mapping tables produced by the
Unicode consortium, specifically 8859-1.TXT.
0x7F 0x007F # DELETE
0x80 0x0080 # <control>
...
0x9F 0x009F # <control>
0xA0 0x00A0 # NO-BREAK SPACE

The Unicode consortium think that 0x7F to 0x9F are Latin-1
control characters -- they use #UNDEFINED to mark characters
that are not defined at all in the source character set --
and for what it's worth, U+007F to U+009F are listed in the
Unicode character data base as *defined* characters with
class Cc, and they formerly even named the functions they
perform.
>
> When it comes to the binary representation of these codepoints.

I specifically wrote about BYTE SEQUENCES. Nothing else is
relevant. I did not write about codepoints.

> Latin-1 is encoded as literal bytes because all codepoints are less than 256.

You can encode Latin 1 in all sorts of ways.
Bytes work because it's a member of the
ECMA "8-Bit Coded Character Set" family.'

> Unicode codepoints on the other hand can be larger than 255 so in order to represent them as bytes they need to be encoded.

That's not relevant. It doesn't matter *what* UTF-8 encodes here,
the only point is that since a UTF-8 sequence is a byte sequence,
and since every byte sequence is a valid Latin 1 encoding, there
is no byte sequence that is a valid UTF-8 sequence but not a valid
Latin 1 sequence.

There are of course many ways to encode Unicode as sequences of bytes.
We could, to be ridiculous, represent each Unicode codepoint as a
sequence of 21 bytes each with value 0x30 or 0x31. More realistically,
SCSU and BOCU have advantages. The thing is, there is no byte
sequence that cannot be interpreted as representing a sequence of
Latin 1 characters (including control characters), so there is no way
of being certain what you have.

Of course an XML document must start with zero or more white space
characters followed by a left angle bracket. A higher level protocol
like that _may_ impose constraints that let you figure out what you
have. Similarly an Erlang module must start with a zero or more
white space characters or % comments followed by a hyphen-minus character.
That is enough to allow XML-style discrimination between big- and
little-endian 4-byte and 2-byte representations, some flavour of
EBCDIC, and some extension of ASCII, but not to discriminate between
Latin 1 and UTF-8.

I've deleted the rest of the message as also beside the point.

Eric Moritz

unread,
Aug 2, 2012, 12:54:50 AM8/2/12
to Richard O'Keefe, Erlang Questions

Sorry. I took your statement out of context.

Thomas Järvstrand

unread,
Aug 2, 2012, 4:54:12 AM8/2/12
to Joe Armstrong, Erlang
Hi,

This is an introductory book right, so how about something like:

<< A "string" is a list of integers where the integers represent characters
    (actually, they are Unicode codepoints that represent characters, but don't
    worry about that right now). >>

Thomas


On Tue, Jul 31, 2012 at 11:24 AM, Joe Armstrong <erl...@gmail.com> wrote:
I'm working on a 2'nd edition of my book, and have got to strings :-)
Strings confuse everybody, including me so I have a few questions:

To start with Erlang doesn't have strings - it has lists (not strings)
and it has string literals.

I want to define a string - is this correct:

<< A "string" is a list of integers where the integers
      represent Unicode codepoints. >>

Questions:
    Is the sentence inside << .. >> using the correct terminology?
    If not what should it say?

    Is the sentence inside << ... >> widely understood, do you think this
    would confuse a lot of people?

    Is the phrase "string literal" widely understood?


Cheers

/Joe

Thomas Lindgren

unread,
Aug 2, 2012, 8:05:41 AM8/2/12
to erlang-q...@erlang.org


It might well be a wooly-headed idea when you look at the details, and I confess to not being an expert in this area. The basic concept would be to warn when, for instance, you've entered your string literals in Latin-1 when the compiler or system options decree that you should use UTF8. I like the overall idea of not leaving encoding problems to the good will of external tools, but if it can't be detected reliably, then it's of course just dreaming. 

Another approach might be to use a heuristic tool a la xref to detect "suspicious" string literals. Not sure if that helps.

Best,
Thomas

Max Bourinov

unread,
Aug 2, 2012, 8:29:29 AM8/2/12
to Joe Armstrong, Erlang
Perfectly clear for me.


Best regards,
Max




On Tue, Jul 31, 2012 at 1:24 PM, Joe Armstrong <erl...@gmail.com> wrote:
I'm working on a 2'nd edition of my book, and have got to strings :-)
Strings confuse everybody, including me so I have a few questions:

To start with Erlang doesn't have strings - it has lists (not strings)
and it has string literals.

I want to define a string - is this correct:

<< A "string" is a list of integers where the integers
      represent Unicode codepoints. >>

Questions:
    Is the sentence inside << .. >> using the correct terminology?
    If not what should it say?

    Is the sentence inside << ... >> widely understood, do you think this
    would confuse a lot of people?

    Is the phrase "string literal" widely understood?


Cheers

/Joe

CGS

unread,
Aug 2, 2012, 9:46:06 AM8/2/12
to Joe Armstrong, Erlang
Hi Joe,

Regarding the clarity, you can see from the length of this thread how clear your definition is. :)

Regarding the correctness, your definition is a bit tricky (arguable if taking into account the difference in between Unicode code points and Unicode encoding schemes) in my opinion (non-expert opinion, though). That's because even if using UTF-8 encoding scheme, for example, Erlang knows nothing about the correlation in between the elements of the list, so, the sequence can be interpreted as code points in Latin-1 region even if those code points may make no real sense in Latin-1 when replaced with the indexed characters (especially in the region 128 - 255). For clarity, your famous "a∞b" in Unicode code points is [97,8734,98] (this format may break the code) while in UTF-8 encoding scheme is reading [97,226,136,158,98] (Erlang compiler has no idea that the sequence [226,136,158] should be built back to 8734 before passing it back to the environment, so, strange symbols may appear if the environment interprets the integers as Unicode code points - which usually does). When UTF-8 support will be available in Erlang, I suppose the string will be accepted internally also as Unicode code points for the range from U+0080 - U+1FFFFF, but until then the accepted integers represent the disconnected UTF-8 encoding scheme sequence of bytes. It is still the user's job to transform them back in Unicode code points for the environment to display correctly the symbols (e.g., io:format("~ts~n",[[97,8734,98]]) will reproduce the correct string in an UTF-8 environment).

This is my 2c opinion (I hope I offended no expert).

CGS

Richard O'Keefe

unread,
Aug 2, 2012, 6:19:43 PM8/2/12
to Thomas Lindgren, erlang-q...@erlang.org

On 3/08/2012, at 12:05 AM, Thomas Lindgren wrote:
>
> Another approach might be to use a heuristic tool a la xref to detect "suspicious" string literals. Not sure if that helps.

Now _that_ is possible and could well be helpful.

For that matter, I invite you to consider this example in pure ASCII:

1> c(snark).
{ok,snark}

where in the shell

m% cat snark.erl
-module(snark).
-export([f/0]).
f() -> [ '', '', '', '', '', '', '', '' ].

but in my text editor,

f() -> [ '^@', '^A', '^B', '^C', '^D', '^E', '^F', '^G' ].

It would probably be a Good Thing if Erlang gave at least a warning
about control characters in atom, string, or character literals
written without using \ .

I'd write this up as an EEP if EEPS didn't have to be in Markdown these days.

Steve Davis

unread,
Aug 2, 2012, 7:53:44 PM8/2/12
to erlang-pr...@googlegroups.com, Erlang, erl...@gmail.com
A "string" is an arbitrary representation of some original binary stream that has been stripped of any information about its encoding format - whether this was a direct transformation or not. No wonder nobody can agree.

Jeff Schultz

unread,
Aug 2, 2012, 8:15:37 PM8/2/12
to erlang-q...@erlang.org
On Fri, Aug 03, 2012 at 10:19:43AM +1200, Richard O'Keefe wrote:
> m% cat snark.erl
> -module(snark).
> -export([f/0]).
> f() -> [ '', '', '', '', '', '', '', '' ].
>
> but in my text editor,
>
> f() -> [ '^@', '^A', '^B', '^C', '^D', '^E', '^F', '^G' ].
>
> It would probably be a Good Thing if Erlang gave at least a warning
> about control characters in atom, string, or character literals
> written without using \ .

Could we just disallow control characters altogether? Or at least
require an option to enable them. Please. They're just accidents
waiting to happen when UTF-8 is eventually implemented and serve no
useful purpose that I can see.


Jeff Schultz
Reply all
Reply to author
Forward
0 new messages