[erlang-questions] unicode in string literals

414 views
Skip to first unread message

Joe Armstrong

unread,
Jul 30, 2012, 8:35:50 AM7/30/12
to Erlang
What is a literal string in Erlang? Originally it was a list of
integers, each integer
being a single character code - this made strings very easy to work with

The code

test() -> "a∞b".

Compiles to code which returns the list
of integers [97,226,136,158,98].

This is very inconvenient. I had expected it to return
[97, 8734, 98]. The length of the list should be 3 not 5
since it contains three unicode characters not five.

Is this a bug or a horrible misfeature?

So how can I make a string with the three characters 'a' 'infinity' 'b'

test() -> "a\x{221e}b" is ugly

test() -> <<"a∞b"/utf8>> seems to be a bug
it gives an error in the
shell but is ok in compiled code and
returns
<<97,195,162,194,136,194,158,98>> which is
very strange

test() -> [$a,8734,$b] is ugly

/Joe
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Richard Carlsson

unread,
Jul 30, 2012, 9:02:13 AM7/30/12
to erlang-q...@erlang.org
On 07/30/2012 02:35 PM, Joe Armstrong wrote:
> What is a literal string in Erlang? Originally it was a list of
> integers, each integer
> being a single character code - this made strings very easy to work with
>
> The code
>
> test() -> "a∞b".
>
> Compiles to code which returns the list
> of integers [97,226,136,158,98].
>
> This is very inconvenient. I had expected it to return
> [97, 8734, 98]. The length of the list should be 3 not 5
> since it contains three unicode characters not five.
>
> Is this a bug or a horrible misfeature?

You saved your source file as UTF-8, so between the two double-quotes,
the source file contains exactly those bytes. But the Erlang compiler
assumes your source code is Latin-1, so it thinks that you wrote a
Latin-1 string of 5 characters (some of which are non-printing). There's
as yet no support for telling the compiler that the input is anything
else than Latin-1, so you can't save your source files as UTF-8. (One
thing you can do is put the UTF-8 strings in another file and read them
at runtime.)

> test() -> <<"a∞b"/utf8>> seems to be a bug

Try <<"åäö"/utf8>>. It works, but like your first example, the source
string is limited to Latin-1. Strings entered in the shell may be
interpreted differently though, depending on your locale settings.

/Richard

CGS

unread,
Jul 30, 2012, 9:06:38 AM7/30/12
to Joe Armstrong, Erlang
Hi Joe,

You may try unicode module:

test() -> unicode:characters_to_list("a∞b",utf8).

which will return the desired list [97,8734,98]. As Richard said, the default is Latin-1 (0-255 integers).

As for binaries, the same problem (assuming Latin-1).

CGS

Masklinn

unread,
Jul 30, 2012, 9:13:49 AM7/30/12
to Richard Carlsson, erlang-q...@erlang.org
On 2012-07-30, at 15:02 , Richard Carlsson wrote:

> On 07/30/2012 02:35 PM, Joe Armstrong wrote:
>> What is a literal string in Erlang? Originally it was a list of
>> integers, each integer
>> being a single character code - this made strings very easy to work with
>>
>> The code
>>
>> test() -> "a∞b".
>>
>> Compiles to code which returns the list
>> of integers [97,226,136,158,98].
>>
>> This is very inconvenient. I had expected it to return
>> [97, 8734, 98]. The length of the list should be 3 not 5
>> since it contains three unicode characters not five.
>>
>> Is this a bug or a horrible misfeature?
>
> You saved your source file as UTF-8, so between the two double-quotes, the source file contains exactly those bytes. But the Erlang compiler assumes your source code is Latin-1

I'd expect the string manipulation functions of Erlang assume that as
well (that strings are lists of "bytes"), don't they? E.g. that `words`
splits on 0x20 (and maybe 0xA0), not on the {{Zs}} general category?

Richard Carlsson

unread,
Jul 30, 2012, 9:23:09 AM7/30/12
to erlang-q...@erlang.org
On 07/30/2012 03:06 PM, CGS wrote:
> Hi Joe,
>
> You may try unicode module:
>
> test() -> unicode:characters_to_list("a∞b",utf8).
>
> which will return the desired list [97,8734,98]. As Richard said, the
> default is Latin-1 (0-255 integers).

No! Don't save a source file as UTF8, at least without a way of marking
up such files as being special. The problem is that if you do the trick
above, you have to ensure that you convert _all_ string literals
explicitly this way (at least if they may contain characters outside
ASCII). But if you have a character such as ö, or é, in a string and you
forget to convert explicitly from UTF8 to single code points, then that
"é" will in fact be 2 bytes, while in another module saved in Latin-1,
the string "é" that looks the same in your editor will be a single byte,
and they won't compare equal. Having modules saved with different
encodings is a recipe for disaster (in particular when it comes to
future maintenance). Erlang currently only supports Latin-1 in source
files; until that is fixed, you should keep your UTF8-data in separate
files.

/Richard

Richard Carlsson

unread,
Jul 30, 2012, 9:28:12 AM7/30/12
to Masklinn, erlang-q...@erlang.org
On 07/30/2012 03:13 PM, Masklinn wrote:
> I'd expect the string manipulation functions of Erlang assume that as
> well (that strings are lists of "bytes"), don't they? E.g. that `words`
> splits on 0x20 (and maybe 0xA0), not on the {{Zs}} general category?

Yes, the old "string" module in the Erlang stdlib is not much use for
working with Unicode strings. You should use something like the "ux"
library (https://github.com/freeakk/ux) or Erlang bindings to ICU (can't
seem to find the link, but I think there are more than one
implementation of such bindings)

/Richard

Richard Carlsson

unread,
Jul 30, 2012, 9:39:16 AM7/30/12
to erlang-q...@erlang.org
Since this encoding confusion seems to be regularly occurring on this
list, I might as well post a link to a set of slides I originally made
for our internal training:

http://www.scribd.com/doc/86177907/Encodings-Unicode-and-Erlang-by-Richard-Carlsson

I do apologize for the uglyness; I'm no powerpoint wizard to begin with,
and they seem to have been a bit mangled by the upload to Scribd.

Kirill Zaborsky

unread,
Jul 30, 2012, 9:43:37 AM7/30/12
to Richard Carlsson, erlang-q...@erlang.org
Richard, 
Is it possible to get presentation pdf for free?

Kind regards,
Kirill Zaborsky

2012/7/30 Richard Carlsson <carlsson...@gmail.com>

Richard Carlsson

unread,
Jul 30, 2012, 9:50:42 AM7/30/12
to Kirill Zaborsky, erlang-q...@erlang.org
On 07/30/2012 03:43 PM, Kirill Zaborsky wrote:
> Richard,
> Is it possible to get presentation pdf for free?

Absolutely: https://dl.dropbox.com/u/985859/Encodings.pdf

CGS

unread,
Jul 30, 2012, 9:52:00 AM7/30/12
to Richard Carlsson, erlang-q...@erlang.org
Valid point. I didn't say the solution should be necessary used, I just gave a solution which gives an answer for the raised problem. How it is used, I think Joe doesn't need any other instruction (especially from me). :)

CGS

Masklinn

unread,
Jul 30, 2012, 10:01:05 AM7/30/12
to Richard Carlsson, erlang-q...@erlang.org

On 2012-07-30, at 15:39 , Richard Carlsson wrote:

> Since this encoding confusion seems to be regularly occurring on this list, I might as well post a link to a set of slides I originally made for our internal training:
>
> http://www.scribd.com/doc/86177907/Encodings-Unicode-and-Erlang-by-Richard-Carlsson
>
> I do apologize for the uglyness; I'm no powerpoint wizard to begin with, and they seem to have been a bit mangled by the upload to Scribd.

Looks fine to me, although Speakerdeck would probably be lighter on the browser.

Joe Armstrong

unread,
Jul 30, 2012, 10:25:54 AM7/30/12
to CGS, Erlang
On Mon, Jul 30, 2012 at 3:06 PM, CGS <cgsmc...@gmail.com> wrote:
> Hi Joe,
>
> You may try unicode module:
>
> test() -> unicode:characters_to_list("a∞b",utf8).
>
> which will return the desired list [97,8734,98]. As Richard said, the
> default is Latin-1 (0-255 integers).

Very strange I tried that earlier, this is what happens:

$ Eshell V5.9 (abort with ^G)
1> unicode:characters_to_list([97,226,136,158,98], utf8).
[97,226,136,158,98]

The manual says the first argument is a utf8 string

/Joe

/Joe

Richard Carlsson

unread,
Jul 30, 2012, 10:39:03 AM7/30/12
to Dmitrii Dimandt, erlang-q...@erlang.org
It only works for members (such as yourself, Dmitrii), so I guessed
that's why he used the words "for free".

/Richard

On 07/30/2012 04:20 PM, Dmitrii Dimandt wrote:
> There's a "Download or Print" button on Scribd as well ;) To the right
> of the document
>
>> Richard,
>> Is it possible to get presentation pdf for free?
>>
>> Kind regards,
>> Kirill Zaborsky
>>
>> 2012/7/30 Richard Carlsson <carlsson...@gmail.com
>> <mailto:carlsson...@gmail.com>>
>>
>> Since this encoding confusion seems to be regularly occurring on
>> this list, I might as well post a link to a set of slides I
>> originally made for our internal training:
>>
>> http://www.scribd.com/doc/__86177907/Encodings-Unicode-__and-Erlang-by-Richard-Carlsson
>> <http://www.scribd.com/doc/86177907/Encodings-Unicode-and-Erlang-by-Richard-Carlsson>
>>
>> I do apologize for the uglyness; I'm no powerpoint wizard to begin
>> with, and they seem to have been a bit mangled by the upload to
>> Scribd.
>>
>>
>> /Richard
>>
>> _________________________________________________
>> erlang-questions mailing list
>> erlang-q...@erlang.org <mailto:erlang-q...@erlang.org>
>> http://erlang.org/mailman/__listinfo/erlang-questions
>> <http://erlang.org/mailman/listinfo/erlang-questions>
>>
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-q...@erlang.org <mailto:erlang-q...@erlang.org>
>> http://erlang.org/mailman/listinfo/erlang-questions

Joe Armstrong

unread,
Jul 30, 2012, 10:42:08 AM7/30/12
to Richard Carlsson, erlang-q...@erlang.org
On Mon, Jul 30, 2012 at 3:02 PM, Richard Carlsson
<carlsson...@gmail.com> wrote:
> On 07/30/2012 02:35 PM, Joe Armstrong wrote:
>>
>> What is a literal string in Erlang? Originally it was a list of
>> integers, each integer
>> being a single character code - this made strings very easy to work with
>>
>> The code
>>
>> test() -> "a∞b".
>>
>> Compiles to code which returns the list
>> of integers [97,226,136,158,98].
>>
>> This is very inconvenient. I had expected it to return
>> [97, 8734, 98]. The length of the list should be 3 not 5
>> since it contains three unicode characters not five.
>>
>> Is this a bug or a horrible misfeature?
>
>
> You saved your source file as UTF-8, so between the two double-quotes, the
> source file contains exactly those bytes. But the Erlang compiler assumes
> your source code is Latin-1, so it thinks that you wrote a Latin-1 string of
> 5 characters (some of which are non-printing). There's as yet no support for
> telling the compiler that the input is anything else than Latin-1, so you
> can't save your source files as UTF-8. (One thing you can do is put the
> UTF-8 strings in another file and read them at runtime.)

Oh dear - you're right of course.

This means that the only portable and 100% correct way to get 'a'
'INFINITY' 'b' into a string literal
to say

"a,\x{221e},b"

"a∞b" in any form won't work if the compiler is not explicitly
told "this file is utf8"

Should the pre-processor make a rude noise and only accept latin1
printable characters?

/joe

Richard Carlsson

unread,
Jul 30, 2012, 11:07:16 AM7/30/12
to erlang-q...@erlang.org
n 07/30/2012 04:25 PM, Joe Armstrong wrote:
> Very strange I tried that earlier, this is what happens:
>
> $ Eshell V5.9 (abort with ^G)
> 1> unicode:characters_to_list([97,226,136,158,98], utf8).
> [97,226,136,158,98]
>
> The manual says the first argument is a utf8 string

The unicode:characters_to_list() function has tripped me up more than
once, and the documentation isn't very clear. The key to understanding
it seems to be to look at the possible types for the input:

Data = latin1_chardata() | chardata() | external_chardata()

These are just versions of chardata(), i.e., possibly deep lists with
mixed integers and binaries, and they only differ in how binary segments
should be interpreted. If there are integers in the list, they will
always be interpreted as full Unicode code points, not needing any
conversion. So if your input is a list (or any Latin1-encoded IO-list),
the following should work:

unicode:characters_to_list(iolist_to_binary([97,226,136,158,98]), utf8).

/Richard

Richard Carlsson

unread,
Jul 30, 2012, 11:17:51 AM7/30/12
to Joe Armstrong, erlang-q...@erlang.org
On 07/30/2012 04:42 PM, Joe Armstrong wrote:
> Should the pre-processor make a rude noise and only accept latin1
> printable characters?

Maybe. Personally, I've never liked non-printing characters or
whitespace other than plain spaces in strings; if you write a program
that should output e.g. a tab as part of a string, and you use ^I in the
source code instead of \t, how can you stay sure that the editor hasn't
at some point changed it to a space, in particular with many different
people working on the code? Even leading or trailing spaces are prone to
being edited out by mistake. (Erlang supports the \s escape as a synonym
for space, for example $\s, but I don't know any other language that has
this pretty obvious feature.)

Michael Uvarov

unread,
Jul 30, 2012, 1:33:20 PM7/30/12
to Richard Carlsson, erlang-q...@erlang.org
Hi,

I can write a parse transform for ux, that converts bytes into code
points. Something like this:
ustr("a∞b")

The same idea is used here:
https://github.com/freeakk/i18n#using-unicode-strings-in-source-code
But it converts strings into a ICU format: binary, utf-16. And it is a hack.

Richard O'Keefe

unread,
Jul 30, 2012, 6:44:46 PM7/30/12
to Richard Carlsson, erlang-q...@erlang.org
The thing that puzzles me about Erlang assuming that source files are in
Latin 1 is that I have a tokenizer for Erlang that assumes Latin 1 and
in every Erlang/OTP release I've checked there has been at least one
file it tripped up on because of UTF-8 characters.

When can we expect -encoding('whatever'). to be supported?

Richard O'Keefe

unread,
Jul 30, 2012, 6:57:33 PM7/30/12
to Richard Carlsson, erlang-q...@erlang.org

On 31/07/2012, at 3:17 AM, Richard Carlsson wrote:
> Even leading or trailing spaces are prone to being edited out by mistake. (Erlang supports the \s escape as a synonym for space, for example $\s, but I don't know any other language that has this pretty obvious feature.)

Some Prolog systems do (the tokeniser in my book offered it, IIRC).
One of the people on the Prolog standard mailing list has been jumping
up and down in anger because SWI Prolog has it. But when you have
characters like 0'x, 0'y, 0' , it's really nice to have 0\s. (Don't
laugh. Erlang's $x, $y, $ , is just as bad.)

I actually have a keystroke in my emacs-ish editor to remove trailing
spaces because their presence breaks things more often than it helps.

Michael Truog

unread,
Jul 30, 2012, 7:41:24 PM7/30/12
to Richard O'Keefe, erlang-q...@erlang.org
On 07/30/2012 03:44 PM, Richard O'Keefe wrote:
> The thing that puzzles me about Erlang assuming that source files are in
> Latin 1 is that I have a tokenizer for Erlang that assumes Latin 1 and
> in every Erlang/OTP release I've checked there has been at least one
> file it tripped up on because of UTF-8 characters.
>
> When can we expect -encoding('whatever'). to be supported?

The solution with the way things are currently, is just to use modelines (within the first 3 lines of the file) which are supported in your favorite editor, vi or emacs:
% -*- coding: utf-8; Mode: erlang; tab-width: 4; c-basic-offset: 4; indent-tabs-mode: nil -*-
% ex: set softtabstop=4 tabstop=4 shiftwidth=4 expandtab fileencoding=utf-8:

You just need to make sure modeline support is turned on (vim seems to default modeline support to off).

Michel Rijnders

unread,
Jul 31, 2012, 3:00:16 AM7/31/12
to erlang-q...@erlang.org
On Tue, Jul 31, 2012 at 1:41 AM, Michael Truog <mjt...@gmail.com> wrote:
> On 07/30/2012 03:44 PM, Richard O'Keefe wrote:
>> The thing that puzzles me about Erlang assuming that source files are in
>> Latin 1 is that I have a tokenizer for Erlang that assumes Latin 1 and
>> in every Erlang/OTP release I've checked there has been at least one
>> file it tripped up on because of UTF-8 characters.
>>
>> When can we expect -encoding('whatever'). to be supported?
>
> The solution with the way things are currently, is just to use modelines (within the first 3 lines of the file) which are supported in your favorite editor, vi or emacs:
> % -*- coding: utf-8; Mode: erlang; tab-width: 4; c-basic-offset: 4; indent-tabs-mode: nil -*-
> % ex: set softtabstop=4 tabstop=4 shiftwidth=4 expandtab fileencoding=utf-8:
>

Shouldn't that modeline read:
% -*- coding: latin-1; mode: erlang; tab-width: 4; c-basic-offset: 4;
indent-tabs-mode: nil -*-

Since the compiler assumes source files are in Latin 1?

> You just need to make sure modeline support is turned on (vim seems to default modeline support to off).
> _______________________________________________
> erlang-questions mailing list
> erlang-q...@erlang.org
> http://erlang.org/mailman/listinfo/erlang-questions

--
My other car is a cdr.

Joe Armstrong

unread,
Jul 31, 2012, 3:05:54 AM7/31/12
to Richard O'Keefe, erlang-q...@erlang.org
Is "encoding(...)" a good idea?

There are four reasonable alternatives

a) - all files are Latin1
b) - all files are UTF8
c) - all files are Latin1 or UTF8 and you guess
d) - all files are Latin1 or UTF8 or anything else and you tell

Today we do a).

What would be the consequences of changing to b) in (say) the next
major release?

This would break some code - but how much? - how much code is there
with non Latin1 printable characters
in string literals? - it should be easy to write a program to test for
this and flag sting literals that
might causes problems if the default convention was changed.

/Joe

Michael Truog

unread,
Jul 31, 2012, 3:09:46 AM7/31/12
to Michel Rijnders, erlang-q...@erlang.org
On 07/31/2012 12:00 AM, Michel Rijnders wrote:
> On Tue, Jul 31, 2012 at 1:41 AM, Michael Truog <mjt...@gmail.com> wrote:
>> On 07/30/2012 03:44 PM, Richard O'Keefe wrote:
>>> The thing that puzzles me about Erlang assuming that source files are in
>>> Latin 1 is that I have a tokenizer for Erlang that assumes Latin 1 and
>>> in every Erlang/OTP release I've checked there has been at least one
>>> file it tripped up on because of UTF-8 characters.
>>>
>>> When can we expect -encoding('whatever'). to be supported?
>> The solution with the way things are currently, is just to use modelines (within the first 3 lines of the file) which are supported in your favorite editor, vi or emacs:
>> % -*- coding: utf-8; Mode: erlang; tab-width: 4; c-basic-offset: 4; indent-tabs-mode: nil -*-
>> % ex: set softtabstop=4 tabstop=4 shiftwidth=4 expandtab fileencoding=utf-8:
>>
> Shouldn't that modeline read:
> % -*- coding: latin-1; mode: erlang; tab-width: 4; c-basic-offset: 4;
> indent-tabs-mode: nil -*-
>
> Since the compiler assumes source files are in Latin 1

I think the point was to use utf8 in the source file, thus the utf8 in the modeline. The encoding() would be necessary for various erlang names (like functions, variables, etc.) to be in utf8, but the modeline could help keep list data as utf8.

Richard O'Keefe

unread,
Jul 31, 2012, 3:19:47 AM7/31/12
to Michael Truog, erlang-q...@erlang.org
The snag with mode lines is that they tell the *editor* what to do,
but as they are comments they do not tell the *compiler* one blessed
thing, unless you wire the peculiarities of two editors (Emacs, VIle)
into your compiler (and then what of NetBeans, Eclipse, Visual Studio,
and a horde of other editors).

Michel Rijnders

unread,
Jul 31, 2012, 3:32:51 AM7/31/12
to Michael Truog, erlang-q...@erlang.org
On Tue, Jul 31, 2012 at 9:09 AM, Michael Truog <mjt...@gmail.com> wrote:
>-----8<----------
>>> The solution with the way things are currently, is just to use modelines (within the first 3 lines of the file) which are supported in your favorite editor, vi or emacs:
>>> % -*- coding: utf-8; Mode: erlang; tab-width: 4; c-basic-offset: 4; indent-tabs-mode: nil -*-
>>> % ex: set softtabstop=4 tabstop=4 shiftwidth=4 expandtab fileencoding=utf-8:
>>>
>> Shouldn't that modeline read:
>> % -*- coding: latin-1; mode: erlang; tab-width: 4; c-basic-offset: 4;
>> indent-tabs-mode: nil -*-
>>
>> Since the compiler assumes source files are in Latin 1
>
> I think the point was to use utf8 in the source file, thus the utf8 in the modeline. The encoding() would be necessary for various erlang names (like functions, variables, etc.) to be in utf8, but the modeline could help keep list data as utf8.

IMO this doesn't solve the problem, and only confuses the issue;
consider the following:

test() ->
io:format("~w~n", ["Just my €0.02"]),
io:format("~w~n", [lists:reverse("Just my €0.02")]).

> test().
[74,117,115,116,32,109,121,32,226,130,172,48,46,48,50]
[50,48,46,48,172,130,226,32,121,109,32,116,115,117,74]

If the list data was kept as UTF-8 then the output of the second
statement should be:
[50,48,46,48,226,130,172,32,121,109,32,116,115,117,74]

The above of course depends on whether you view strings as lists of
bytes vs lists of characters.

--
My other car is a cdr.

Richard O'Keefe

unread,
Jul 31, 2012, 3:33:26 AM7/31/12
to Joe Armstrong, erlang-q...@erlang.org

On 31/07/2012, at 7:05 PM, Joe Armstrong wrote:

> Is "encoding(...)" a good idea?
>
> There are four reasonable alternatives
>
> a) - all files are Latin1

No good for people who need to write (comments, strings, quoted
atoms) in a language not limited to a Western European script.

> b) - all files are UTF8

No good for people who are perfectly happy with Latin 1 (me!)
and who need the occasional character outside ASCII (like, oh,
some people in Sweden maybe?) But could be tolerable.

> c) - all files are Latin1 or UTF8 and you guess

Guessing is always a bad idea.

> d) - all files are Latin1 or UTF8 or anything else and you tell

It works for XML. :- encoding(...) works for SWI Prolog:

:- encoding(+Encoding)
This directive can appear anywhere in a source file
to define how characters are encoded in the remainder
of the file. It can be used in files that are encoded
with a superset of ASCII, currently UTF-8 and Latin-1.
See also section 2.18.1.

A smart editor like Emacs can be taught to recognise
[:]- ?encoding([']Encoding[']).
at the top of a file just as easily as it can recognise its
own mode-lines.

Vlad Dumitrescu

unread,
Jul 31, 2012, 3:36:40 AM7/31/12
to Joe Armstrong, erlang-q...@erlang.org
Hi,

On Tue, Jul 31, 2012 at 9:05 AM, Joe Armstrong <erl...@gmail.com> wrote:
> Is "encoding(...)" a good idea?
>
> There are four reasonable alternatives
> a) - all files are Latin1
> b) - all files are UTF8
> c) - all files are Latin1 or UTF8 and you guess
> d) - all files are Latin1 or UTF8 or anything else and you tell

By the question above, do you mean to imply that '-encoding(...)' will
allow mixed encodings in a project, which is not a reasonable
alternative?

> Today we do a).
> What would be the consequences of changing to b) in (say) the next
> major release?
>
> This would break some code - but how much? - how much code is there
> with non Latin1 printable characters
> in string literals?

I don't think that would be the single problem, but also all the code
that assumes that source code is latin-1. Also, tools that handle
source code will need to be able to recognize both the old and new
encodings, as they might need to have to work with an older version of
a file, before the conversion.

Another question that needs to be answered is also what encoding will
the source code use outside strings and quoted atoms and comments: do
we want atoms and variable names to be utf8 too? Because I've seen at
least an example of code that uses extended latin-1 characters in
those places.

Also, what should string manipulation functions do by default, should
they assume an encoding? I think the only way to remain sane would be
to have a special string type, tagged with the encoding -- as it is
now, one can use string manipulation functions on lists of arbitrary
integers and list manipulation functions on strings.

Would a syntactic construct like u"some string" that returns a tagged
utf8 string help?

best regards,
Vlad

Richard Carlsson

unread,
Jul 31, 2012, 3:39:55 AM7/31/12
to Richard O'Keefe, erlang-q...@erlang.org
On 07/31/2012 12:57 AM, Richard O'Keefe wrote:
>
> On 31/07/2012, at 3:17 AM, Richard Carlsson wrote:
>> Even leading or trailing spaces are prone to being edited out by
>> mistake. (Erlang supports the \s escape as a synonym for space, for
>> example $\s, but I don't know any other language that has this
>> pretty obvious feature.)
>
> Some Prolog systems do (the tokeniser in my book offered it, IIRC).
> One of the people on the Prolog standard mailing list has been
> jumping up and down in anger because SWI Prolog has it. But when you
> have characters like 0'x, 0'y, 0' , it's really nice to have 0\s.
> (Don't laugh. Erlang's $x, $y, $ , is just as bad.)

Yes, I think it was some horrible occurrences of $ , in erl_scan that
originally got me to suggest that \s should be added to Erlang, around
the time when Barklund was working on the draft Standard.

> I actually have a keystroke in my emacs-ish editor to remove
> trailing spaces because their presence breaks things more often than
> it helps.

I wasn't terribly clear - what I meant was leading or trailing spaces
within strings, as in "Hello ". It's always slightly worrying that
someone might not see that they just removed something important.

/Richard

Michel Rijnders

unread,
Jul 31, 2012, 3:39:58 AM7/31/12
to Joe Armstrong, erlang-q...@erlang.org
On Tue, Jul 31, 2012 at 9:05 AM, Joe Armstrong <erl...@gmail.com> wrote:
> Is "encoding(...)" a good idea?
>
> There are four reasonable alternatives
>
> a) - all files are Latin1
> b) - all files are UTF8
> c) - all files are Latin1 or UTF8 and you guess
> d) - all files are Latin1 or UTF8 or anything else and you tell

I understand it is quite drastic but I would prefer a separate data
type for (unicode) strings.
--
My other car is a cdr.

Ulf Wiger

unread,
Jul 31, 2012, 3:40:41 AM7/31/12
to Joe Armstrong, erlang-q...@erlang.org

The problem is that this is editor-dependent.

The one time I ran into problems with encoding was when editing a
perfectly normal file with Notepad+, which saves in utf8 unless you
dive into the settings and manage to tell it not to.

For people who need to support multiple OTP versions, including
those who maintain Open Source components, it would be a major
headache to have to maintain different file formats for different OTP
releases.

I vote for a compiler option, keeping Latin-1 as the default.

BR,
Ulf W


On 31 Jul 2012, at 09:05, Joe Armstrong wrote:

> Is "encoding(...)" a good idea?
>
> There are four reasonable alternatives
>
> a) - all files are Latin1
> b) - all files are UTF8
> c) - all files are Latin1 or UTF8 and you guess
> d) - all files are Latin1 or UTF8 or anything else and you tell
>
> Today we do a).
>
> What would be the consequences of changing to b) in (say) the next
> major release?
>
> This would break some code - but how much? - how much code is there
> with non Latin1 printable characters
> in string literals? - it should be easy to write a program to test for
> this and flag sting literals that
> might causes problems if the default convention was changed.
>
> /Joe

Ulf Wiger, Co-founder & Developer Advocate, Feuerlabs Inc.
http://feuerlabs.com

Masklinn

unread,
Jul 31, 2012, 3:41:44 AM7/31/12
to erlang-questions Questions
On 2012-07-31, at 09:09 , Michael Truog wrote:
>
> I think the point was to use utf8 in the source file, thus the utf8 in the modeline. The encoding() would be necessary for various erlang names (like functions, variables, etc.) to be in utf8, but the modeline could help keep list data as utf8.

For what it's worth, Python solved that particular issue (and
redundancy) by adding limited modeline support in the language's
parser[0].

Basically, when parsing a file it looks for the pattern
`coding=<encoding name>` in a comment in the first and second (to
account for shebangs) lines of the file, and if that line is present it
uses the specified encoding for the file.

That avoids the requirement of specifying the encoding for the editor
*and* for the compiler.

[0] http://www.python.org/dev/peps/pep-0263/

Masklinn

unread,
Jul 31, 2012, 3:53:52 AM7/31/12
to erlang-questions Questions
On 2012-07-31, at 09:39 , Michel Rijnders wrote:

> On Tue, Jul 31, 2012 at 9:05 AM, Joe Armstrong <erl...@gmail.com> wrote:
>> Is "encoding(...)" a good idea?
>>
>> There are four reasonable alternatives
>>
>> a) - all files are Latin1
>> b) - all files are UTF8
>> c) - all files are Latin1 or UTF8 and you guess
>> d) - all files are Latin1 or UTF8 or anything else and you tell
>
> I understand it is quite drastic but I would prefer a separate data
> type for (unicode) strings.

For historical reasons? Because on technical grounds, the existing
scheme would work nicely by declaring that the integers are code points.
And because Unicode is identical to latin-1 in the first 256 codepoints,
latin1 strings would be identical.

The `string` module would probably need to be fixed to be unicode-aware
(or deprecated and removed altogether in favor of the unicode one), but
I'm not sure there are good reasons to change the datatype.[-1]

On the other hand, a dedicated datatype could allow things like Python's
new Flexible String Representation[0] where an explicit "list of code
points" would not allow such flexibility.

The only thing I'd rather avoid is moving from "list of latin-1 bytes" to
"list of utf-8 bytes", that's just crap.

[-1] Actually there's one now that I re-think about it thanks to your
previous mail about lists:reverse: naive list methods will completely
break combining characters or decomposed (NFD and NFKD) strings, even
if strings are encoded as lists of codepoints.

[0] http://www.python.org/dev/peps/pep-0393/ where strings are opaque and
can dynamically changed their internal representation between latin-1,
UCS2 and UCS4 to best fit their content, one could even add rope-like
structures so that strings are internally mixed between the
representations if there is cause to)

Richard Carlsson

unread,
Jul 31, 2012, 4:02:58 AM7/31/12
to erlang-q...@erlang.org
On 07/31/2012 09:32 AM, Michel Rijnders wrote:
> IMO this doesn't solve the problem, and only confuses the issue;
> consider the following:
>
> test() ->
> io:format("~w~n", ["Just my €0.02"]),
> io:format("~w~n", [lists:reverse("Just my €0.02")]).
>
>> test().
> [74,117,115,116,32,109,121,32,226,130,172,48,46,48,50]
> [50,48,46,48,172,130,226,32,121,109,32,116,115,117,74]

Yes, this is what happens today, because all involved parts (including
the call to io:format with ~w) assumes Latin-1 and just passes all the
bytes straight through. Basically, it's your editor and terminal that
are lying by displaying a particular sequence of 3 bytes as € although
the program is really using Latin-1. They conspire against you to make
you think that things are working correctly.

> If the list data was kept as UTF-8 then the output of the second
> statement should be:
> [50,48,46,48,226,130,172,32,121,109,32,116,115,117,74]

That would only be the result if you used a single code point
representation for the input to reverse, and then converted the result
back to a byte encoding (e.g. by printing with ~ts).

> The above of course depends on whether you view strings as lists of
> bytes vs lists of characters.

Strings are lists of characters (code points), so when your example gets
through tokenization, the encoding from the file would already be
forgotten, and you'd have a single integer for the €. (The same goes for
atoms and variable names, by the way, the answer to so Vlad's question
is that these will also get a greater range of available characters.)
String manipulation functions should assume they are working on single
code points, not on a byte encoding.

/Richard

Loïc Hoguin

unread,
Jul 31, 2012, 4:10:14 AM7/31/12
to Masklinn, erlang-questions Questions
On 07/31/2012 09:53 AM, Masklinn wrote:
> On 2012-07-31, at 09:39 , Michel Rijnders wrote:
>
>> On Tue, Jul 31, 2012 at 9:05 AM, Joe Armstrong <erl...@gmail.com> wrote:
>>> Is "encoding(...)" a good idea?
>>>
>>> There are four reasonable alternatives
>>>
>>> a) - all files are Latin1
>>> b) - all files are UTF8
>>> c) - all files are Latin1 or UTF8 and you guess
>>> d) - all files are Latin1 or UTF8 or anything else and you tell
>>
>> I understand it is quite drastic but I would prefer a separate data
>> type for (unicode) strings.
>
> For historical reasons? Because on technical grounds, the existing
> scheme would work nicely by declaring that the integers are code points.
> And because Unicode is identical to latin-1 in the first 256 codepoints,
> latin1 strings would be identical.
>
> The `string` module would probably need to be fixed to be unicode-aware
> (or deprecated and removed altogether in favor of the unicode one), but
> I'm not sure there are good reasons to change the datatype.[-1]
>
> On the other hand, a dedicated datatype could allow things like Python's
> new Flexible String Representation[0] where an explicit "list of code
> points" would not allow such flexibility.
>
> The only thing I'd rather avoid is moving from "list of latin-1 bytes" to
> "list of utf-8 bytes", that's just crap.

If strings are kept as lists:

- there is no way to identify a variable as being a list or latin1
string or utf8 string
- you would have to keep track of what encoding your list is in
- you would have to do some type conversion when you use them with
functions like gen_tcp:send, which don't accept lists of integers > 255

If strings are a new type:

- you don't care about the encoding most of the time, Erlang is the one
who should; if you want to know the encoding you could use a new BIF
encoding(String)
- you don't need to do type conversion when using it, Erlang can use the
string type directly
- you can convert encoding without caring about what the previous
encoding was, for example str:convert(Str, utf8); if it was utf8 it
doesn't change a thing, if it wasn't it's converted
- you can export it as a list or binary in the encoding you want, for
example str:to_binary(Str, utf8)
- you still need to specify the encoding when converting a list or
binary to string, but maybe we could have niceties like <<
Str/string-utf8 >>?

--
Loïc Hoguin
Erlang Cowboy
Nine Nines
http://ninenines.eu

Masklinn

unread,
Jul 31, 2012, 4:40:59 AM7/31/12
to erlang-questions Questions
None applies, strings would be lists of codepoints, the original
encoding has been long forgotten at that point and is utterly
irrelevant.

> - you would have to do some type conversion when you use them with functions like gen_tcp:send, which don't accept lists of integers > 255

You would have to encode the list to whatever is expected by the other
side, on input strings.

> If strings are a new type:
>
> - you don't care about the encoding most of the time, Erlang is the one who should; if you want to know the encoding you could use a new BIF encoding(String)
> - you can convert encoding without caring about what the previous encoding was, for example str:convert(Str, utf8); if it was utf8 it doesn't change a thing, if it wasn't it's converted
> - you can export it as a list or binary in the encoding you want, for example str:to_binary(Str, utf8)

See above, none of these makes sense unless you assume that the
string-list is a list of bytes in a specific encoding which does not
make sense either, in the first place.

> - you don't need to do type conversion when using it, Erlang can use the string type directly

How can't Erlang use the string-list type directly? That's what it
currently does. There's no conversion.

CGS

unread,
Jul 31, 2012, 5:36:03 AM7/31/12
to erlang-questions Questions
There are many pros and cons for switching from Latin-1 to UTF-8 (or whatever else which will nullify pretty much the understanding of byte character). On one hand, lists:reverse/1 really messes up the characters in the list (to follow the first example, the output of "a∞b" in Latin-1 is totally different from the output of lists:reverse("b∞a") in Latin-1 - the default now). On the other hand, having, for example, Polish characters like "Ą Ę Ć" or French "Ç Î" or German "Ö ß" or Turkish "Ş" and so on (things become more complicated if we add languages based on different alphabet/symbols) in the code would require your editor to have support for those languages or else you will see really strange characters there. I do not deny some specific projects would benefit from such a character encoding, but think of maintaining such a code in an international environment.

"-encoding()" can make quite a mess in a file. Think of an open source project in which devs from different countries append their own code. You will see a lot of "-encoding()" directives in a single file.

I might be wrong, but, switching to default UTF-8, wouldn't that force the compiler to use 2-byte (at least) per character? If so, for example, what about the databases based on Erlang for projects using strict Latin-1?

My point here is that the string manipulation should be kept apart from the code itself and to have two modules for manipulating normal lists and IO-lists (e.g., by extending unicode module). But that would be my own preference.

CGS

Masklinn

unread,
Jul 31, 2012, 5:48:06 AM7/31/12
to erlang-questions Questions
On 2012-07-31, at 11:36 , CGS wrote:
>
> I might be wrong, but, switching to default UTF-8, wouldn't that force the
> compiler to use 2-byte (at least) per character?

No? The first 128 code points (ASCII) fit in a single byte.

> If so, for example, what
> about the databases based on Erlang for projects using strict Latin-1?

The ASCII (7-bit) characters would be stored on 1 byte, those beyond
that (until the codepoint 2048) would be on 2 bytes.

CGS

unread,
Jul 31, 2012, 8:14:18 AM7/31/12
to Vlad Dumitrescu, Erlang-Questions
On Tue, Jul 31, 2012 at 11:45 AM, Vlad Dumitrescu <vlad...@gmail.com> wrote:
On Tue, Jul 31, 2012 at 11:36 AM, CGS <cgsmc...@gmail.com> wrote:
> There are many pros and cons for switching from Latin-1 to UTF-8 (or
> whatever else which will nullify pretty much the understanding of byte
> character). ...snip... I do

> not deny some specific projects would benefit from such a character
> encoding, but think of maintaining such a code in an international
> environment.

Also, think about having to debug a system from a remote console that
doesn't support the right encoding (that's probably long-fetched in
this day and age, but possible).


> "-encoding()" can make quite a mess in a file. Think of an open source
> project in which devs from different countries append their own code. You
> will see a lot of "-encoding()" directives in a single file.

My understanding was that there should be one and only one such
directive, at the beginning of the file. I'm not even sure if there
are any editors that can handle files with mixed encodings...


> My point here is that the string manipulation should be kept apart from the
> code itself and to have two modules for manipulating normal lists and
> IO-lists (e.g., by extending unicode module). But that would be my own
> preference.

Yes, but what do you do about string literals? They are in the code...

I would prefer, for example, string literals to be used in debugging using i18n library for translations from English (in the source code) instead of getting strange characters in the log and not understanding the messages in the source code. So, Latin-1 should be enough for that.
 

regards,
Vlad

CGS

unread,
Jul 31, 2012, 8:25:35 AM7/31/12
to Masklinn, erlang-questions Questions
Correct. My bad.

Still, a question remains: how does the compiler make any difference in between a list of integers and a string coded in UTF-8? For example, consider the following case: a list of indexes vs. a string containing special characters in UTF-8. If you apply lists:reverse/1 in UTF-8, you get undesired list for the reversed list of indexes and, vice-versa, if you apply lists:reverse/1 in Latin-1 you get an undesired reversed list for your string. And I don't suppose "-encoding()" would solve this problem either. By dividing the problem in two types of list manipulation, one can easily decide where to apply what.

CGS

Masklinn

unread,
Jul 31, 2012, 8:48:56 AM7/31/12
to erlang-questions Questions
On 2012-07-31, at 14:25 , CGS wrote:

> Still, a question remains: how does the compiler make any difference in
> between a list of integers and a string coded in UTF-8?

It does not, just as it currently does not make a difference. The
distinction is currently informal and based on the usage context of
the list.

> For example,
> consider the following case: a list of indexes vs. a string containing
> special characters in UTF-8. If you apply lists:reverse/1 in UTF-8, you get
> undesired list for the reversed list of indexes and, vice-versa

I touched upon this issue previously. See the first footnote to the message
of id DD8AE349-CF34-42DE...@masklinn.net

> if you
> apply lists:reverse/1 in Latin-1 you get an undesired reversed list for
> your string.

Did you mis-write this? If you wanted to reverse your latin-1 string, this
does reverse the string correctly. It works in neither UTF-8 nor Unicode
contexts though.

> And I don't suppose "-encoding()" would solve this problem
> either.

Correct.

Eric Moritz

unread,
Jul 31, 2012, 3:03:08 PM7/31/12
to Joe Armstrong, erlang-q...@erlang.org
Source code in Latin-1 that used non-ASCII (>=127) bytes would be
invalid UTF-8.

Richard O'Keefe

unread,
Jul 31, 2012, 9:56:49 PM7/31/12
to Vlad Dumitrescu, erlang-q...@erlang.org

On 31/07/2012, at 7:36 PM, Vlad Dumitrescu wrote:
> By the question above, do you mean to imply that '-encoding(...)' will
> allow mixed encodings in a project, which is not a reasonable
> alternative?

It's not clear to me what you mean by a 'project',
but why should a module written by someone who wants
comments in Māori (note the macron? Latin-4 or Unicode needed)
use a module written by someone who wants comments in Swedish?

It's no worse (and no better!) than having a 'project' where
some of the files assume tabs are set every 8 characters and
some of them assume tabs are set every 4 characters. It's a
thing you need written down; it's a thing your tools need to
understand; and it's a situation that doesn't need to persist
with sources that are under your control.

> I don't think that would be the single problem, but also all the code
> that assumes that source code is latin-1. Also, tools that handle
> source code will need to be able to recognize both the old and new
> encodings, as they might need to have to work with an older version of
> a file, before the conversion.

The whole point of an -encoding directive is that it is something
that syntaxtools should handle; by the time your code gets an AST
or a token list, encodings are entirely a thing of the past.

Gambit Scheme allows different files in a program to use different
encodings. It's no big deal: _only_ the code that converts between
a stream of bytes and a stream of characters knows anything about
encodings; internally it's all Unicode.

I haven't done this yet for my Smalltalk compiler because there
are other more urgent issues (like working around C compilers that
are trying to be helpful but fail), but the design work is done and
it should leave the tokeniser running at about the same speed as
the old Latin-1-only tokeniser.

There *will* be a period when I want to keep my old Latin-1 files
(don't fix what isn't broken) but want to start using Unicode in
new work.

SWI Prolog actually lets you change the encoding within a file,
which sounds crazy but maybe Jan wanted the machinery to be there
in case someone wanted ISO 2022 support. (Because that's basically
what 2022 *is*: switching encoding aspects on the fly.)
Why should a Japanese programmer be forbidden to write in her own
script just because some of the source files that get loaded at
run time are encoded in Latin 1?

>
> Another question that needs to be answered is also what encoding will
> the source code use outside strings and quoted atoms and comments

"Encoding" is a whole-file property. If the comments are encoded in
ISO 8859-5 (ISO Cyrillic), so are the strings, and if the strings are
encoded in ISO 8859-5, so are the atoms, both quoted and unquoted.
Encoding logically concerns the interface between the tokeniser and
the external byte stream (in the Unisys ClearPath MCP systems
translation between encodings is done by the operating system before
the data become available to the program). Once the changeover has
been made, the tokeniser should think that *all* characters are
Unicode characters.

> : do
> we want atoms and variable names to be utf8 too? Because I've seen at
> least an example of code that uses extended latin-1 characters in
> those places.

That's not a problem. If a file is encoded in ISO Latin 1, then certain
Unicode characters are encoded a certain way, BUT once into the tokeniser,
nobody knows or cares what that was. If another file is encoded in UTF-8,
then certain Unicode characters are encoded in a different way, BUT once
into the tokeniser, nobody knows or cares what that was.

Encode "(a×2)÷4 = ½a" as 28,61,47,32,29,f7,34,20,3d,20,bd,61 (Latin-1)
or as 28,61,c3,97,32,29,c3,b7,34,20,3d,20,c2,bd,61 (UTF-8),
and as long as the tokeniser knows what it's getting, it should make
*no* difference to what you get, namely the list
[40,97,215,50,41,247,52,32,61,32,189,97] of integers one per Unicode
code-point. That's how it works in SWI Prolog.

> Also, what should string manipulation functions do by default, should
> they assume an encoding?

No. That would make life insanely complicated. (Well, let's face it,
Unicode is already barking mad; this would make it *rabid* barking mad.)

> I think the only way to remain sane would be
> to have a special string type, tagged with the encoding

No, that's a way to go completely crazy.

The simple way is to distinguish between an inside and an outside.
INSIDE, everything is just Unicode. OUTSIDE is where the wild
things are. Encodings are *ONLY* relevant when you switch
between text encoded as byte sequences and text represented as
Unicode code point sequences.

I mean, can you *imagine* the complexity if "0" =:= "0" fails
because the first is tagged as Latin-1 and the second is tagged
as UTF-8?

How Unicode code-point sequences are represented inside the
machine-level representation of an Erlang atom, Erlang source code
should have no reason whatever to care. They could be UTF8; they
could be UTF16; they could be SCSU; they could be BOCU; they could
be something else entirely.

Converting between strings and binaries is the one place where Erlang
source code should have any reason to care, and it does have a reason
to care. But you will perceive that it is the *binary* that needs to
be associated with an encoding, not the *string*.
of the system


>
> Would a syntactic construct like u"some string" that returns a tagged
> utf8 string help?

No. However, <<"some string"/utf8>> *would* make sense.

Richard O'Keefe

unread,
Aug 1, 2012, 12:14:00 AM8/1/12
to CGS, erlang-questions Questions

On 31/07/2012, at 9:36 PM, CGS wrote:

> There are many pros and cons for switching from Latin-1 to UTF-8 (or whatever else which will nullify pretty much the understanding of byte character). On one hand, lists:reverse/1 really messes up the characters in the list

Yes, and that's not all it messes up by any means.

- If you have a sequence of lines represented as a string with network
line terminators (CR+LF) then the reversal of that list is NOT a
sequence of lines with network line terminators (applies to ASCII)

- If you use Unicode language tags, then the reversal of a language
tag is a language tag for a different language and applies to the
wrong characters

- The reversal of a Unicode string including variant selectors (or
other character shaping codes like ZWNJ or ZWJ) is a Unicode
string including variant selectors &c applied to the wrong characters

- The reversal of a Unicode string including a directional command
and a POP DIRECTIONAL FORMATTING code is a string in which there
is a POP before anything has been pushed.

...

So simply forming code points into [base,diacritical...] packets,
reversing the packets, and then flattening *still* isn't nearly
enough to make sense of a reversed string. Indeed, I am not sure
that there *is* any way to make sense of the notion of reversing
a Unicode string.

So I do not take 'lists:reverse/1 will not reverse a Unicodepoint
string correctly' as a criticism of representing strings as lists
of Unicodepoints. NOTHING will. I don't think there is any such
thing as "correctly" reversing such a string.

There are other operations you can easily do with a list that
don't make sense for Unicode strings either. Take just one
example: splitting a string at an arbitrary position. That can
separate a directional override from its pop. And having a
distinct data type is no protection against that problem: Java
and Javascript both have opaque string datatypes, but both
allow slicing a well formed string into pieces that are not
well formed.



> (to follow the first example, the output of "a∞b" in Latin-1 is totally different from the output of lists:reverse("b∞a") in Latin-1 - the default now). On the other hand, having, for example, Polish characters like "Ą Ę Ć" or French "Ç Î" or German "Ö ß" or Turkish "Ş" and so on (things become more complicated if we add languages based on different alphabet/symbols) in the code would require your editor to have support for those languages or else you will see really strange characters there.

Well, yes. But now you are asking whether the editor supports Unicode.
There are now plenty of editors that do. Right now I am composing mail
in an unbelievably crude text editor (the Mail program on Mac OS X) and
it displays these characters just fine.


>
> "-encoding()" can make quite a mess in a file. Think of an open source project in which devs from different countries append their own code. You will see a lot of "-encoding()" directives in a single file.

Nobody is suggesting that there should be an -encoding directive anywhere
but the first line of a file (or possibly the second). In fact it is
precisely the existence of -encoding directives that would make it possible
to *avoid* the mess you are describing.

Here's what you do.

(1) Write a tiny little program. Here is a first draft.

#!/usr/bin/awk -f
# Usage: epaste.awk file1.erl... >pasted.erl
# Purpose: paste files in various encodings giving one file in UTF-8.

BEGIN {
print "-encoding(utf_8)."
for (i = 1; i < ARGC; i++) {
input = ARGV[i]
getline x < input
if (x ~ /^[ \t]*-[ \t]*encoding\([ \t']*[a-zA-Z0-9_]*[ \t']*\)/) {
sub(/^[ \t]*-[ \t]*encoding\([ \t']*/, "", x)
sub(/[ \t']*\).*$/, "", x)
x = toupper(x)
gsub(/_/, "-", x)
cmd = "iconv -f " x " -t UTF-8"
} else {
cmd = "iconv -f ISO-8859-1 -t UTF-8"
print x >cmd
}
while ((getline x <input) > 0) print x >cmd
close(cmd)
}
}

(2) Instead of pasting together several files by doing
cat foo.erl ugh.erl bar.erl >fub.erl
just do
epaste.awk foo.erl ugh.erl bar.erl >fub.erl

What makes this *possible* is the existence of the -encoding lines.
Without it you are FUBAR.

> I might be wrong, but, switching to default UTF-8, wouldn't that force the compiler to use 2-byte (at least) per character?

Yes, you are wrong. Unicode is a 21-bit character set.
There are currently (6.1) more than 100,000 defined
characters, so 2 bytes is definitely not enough.

But UTF-8 is an *external* format.
What the compiler uses is entirely up to itself.
What the run-time system uses is something different again.
Atom names, for example, could be stored in some compressed format.

> If so, for example, what about the databases based on Erlang for projects using strict Latin-1?

What about them? Do not make the mistake of confusing a
particular set of characters with a way of encoding them.

Masklinn

unread,
Aug 1, 2012, 2:57:43 AM8/1/12
to erlang-questions Questions
On 2012-08-01, at 06:14 , Richard O'Keefe wrote:

> And having a
> distinct data type is no protection against that problem: Java
> and Javascript both have opaque string datatypes, but both
> allow slicing a well formed string into pieces that are not
> well formed.

To be fair, they've got the further compounding issue that strings types
are dedicated but not opaque: they are sequences of UTF-16 code units
(on account of originally being UCS2 sequences).

As a result, not only do you have the usual Unicode issues which may or
may not be (non-trivially) solvable (with grapheme-aware unicode handling[0])
that's further compounded by the ability to see and break apart
surrogate pairs (so you can e.g. split a string in the middle of a
surrogate pair).

CPython 3.3 has implemented a fully opaque string type, it exposes unicode
codepoints (if I remember correctly) but that may or may not be the
underlying binary data (the underlying representation can dynamically switch
between latin-1, UCS2 and UCS4)

[0] Which also needs to be locale-aware, for instance a conversion to
lower/upper case is not a 1:1 mapping in unicode as different cultures
may have different uppercases for the same lower and the other way
around, the usual example being Turkish in which "I"'s lowercase is "ı"
and the uppercase of "i" is "İ")

Vlad Dumitrescu

unread,
Aug 1, 2012, 3:30:19 AM8/1/12
to Richard O'Keefe, erlang-q...@erlang.org
Hi Richard,

First, thanks for the detailed explanation. I see I am still confusing
some of the issues.

On Wed, Aug 1, 2012 at 3:56 AM, Richard O'Keefe <o...@cs.otago.ac.nz> wrote:
> On 31/07/2012, at 7:36 PM, Vlad Dumitrescu wrote:
> It's not clear to me what you mean by a 'project',

I mean a set of related code, some of it possibly third-party.

> but why should a module written by someone who wants
> comments in Māori (note the macron? Latin-4 or Unicode needed)
> use a module written by someone who wants comments in Swedish?

Maybe not in the long run, but there will be a (long) transition
period where legacy code will still be used by new code.

> The whole point of an -encoding directive is that it is something
> that syntaxtools should handle; by the time your code gets an AST
> or a token list, encodings are entirely a thing of the past.

Yes, but I am one of the guys that is going to write some of the tools
that will handle this conversion, so I do care about the details.

> SWI Prolog actually lets you change the encoding within a file,
> which sounds crazy but maybe Jan wanted the machinery to be there
> in case someone wanted ISO 2022 support. (Because that's basically
> what 2022 *is*: switching encoding aspects on the fly.)

Are there any editors that can load/save a file with mixed encodings like that?

<...snip...>


> Converting between strings and binaries is the one place where Erlang
> source code should have any reason to care, and it does have a reason
> to care. But you will perceive that it is the *binary* that needs to
> be associated with an encoding, not the *string*.
> of the system

Right. Good explanation!

I am still a little worried about two things:
- debugging a remote system that has different locale
- reading logs created by modules that have different encodings (some
modules might be legacy and not be aware that the world is not Latin-1
anymore).

regards,
Vlad

Richard O'Keefe

unread,
Aug 1, 2012, 9:42:43 PM8/1/12
to Masklinn, erlang-questions Questions

On 1/08/2012, at 6:57 PM, Masklinn wrote:

> On 2012-08-01, at 06:14 , Richard O'Keefe wrote:
>
>> And having a
>> distinct data type is no protection against that problem: Java
>> and Javascript both have opaque string datatypes, but both
>> allow slicing a well formed string into pieces that are not
>> well formed.
>
> To be fair, they've got the further compounding issue that strings types
> are dedicated but not opaque: they are sequences of UTF-16 code units
> (on account of originally being UCS2 sequences).

You are right. I should not have "opaque". The implementation
is *encapsulated*, but the fact that it's a slice of an array of
16-bit units shows through.

As it happens, I *wasn't* referring to the possibility of splitting
a codepoint between two surrogates. If we restrict our attention to
the Basic Multilingual Plane, it is *still* possible to slice a
well formed BMP string into pieces that are not well formed. I have
in mind things like the way Apple used to have two plus signs, one
for left to right text and one for right to left text, but since
Unicode has only one, the way to encode א+ב was
[Aleph, left-to-right override, plus, pop directional formatting, Beth],
and a division that gives the left part either 2 or 3 codepoints is one
that gives you two strings that make no sense.

As it happens, I don't know any programming language that deals with
this. My basic point is that any data structure for text that
*doesn't* ensure that all the 'strings' you deal with are well formed
has already lost its virginity and might as well be frankly and openly
just a sequence of code points.

Richard O'Keefe

unread,
Aug 1, 2012, 9:50:40 PM8/1/12
to Vlad Dumitrescu, erlang-q...@erlang.org

On 1/08/2012, at 7:30 PM, Vlad Dumitrescu wrote:

>
>> but why should a module written by someone who wants
>> comments in Māori (note the macron? Latin-4 or Unicode needed)
>> use a module written by someone who wants comments in Swedish?
>
> Maybe not in the long run, but there will be a (long) transition
> period where legacy code will still be used by new code.

Sorry, my typing mistake here.
What I *meant* to write was "why should a [Māori] module
*NOT* use a [Swedish] one"? You were saying, or so I thought,
that there should be one project = one encoding, and I was saying
I thought that was too restrictive in practice.


>
>> The whole point of an -encoding directive is that it is something
>> that syntaxtools should handle; by the time your code gets an AST
>> or a token list, encodings are entirely a thing of the past.
>
> Yes, but I am one of the guys that is going to write some of the tools
> that will handle this conversion, so I do care about the details.

And by the time it gets to you, there won't *be* any details to care about.


>
>> SWI Prolog actually lets you change the encoding within a file,
>> which sounds crazy but maybe Jan wanted the machinery to be there
>> in case someone wanted ISO 2022 support. (Because that's basically
>> what 2022 *is*: switching encoding aspects on the fly.)
>
> Are there any editors that can load/save a file with mixed encodings like that?

I have no idea. There are a number of editors that claim to support
ISO 2022, which does mid-stream code switching, so they could presumably
be extended to support this. See for example
A model for input and output of multilingual text in a windowing environment
by Yutaka Kataoka, Masato Morisaki, Hiroshi Kuribayashi, and Hiroyoshi Ohara
ACM Transactions on Information Systems (TOIS)
Volume 10 Issue 4, Oct. 1992

>
> I am still a little worried about two things:
> - debugging a remote system that has different locale
> - reading logs created by modules that have different encodings (some
> modules might be legacy and not be aware that the world is not Latin-1
> anymore).

Ouch. And then there are all those documents that lie about the
encoding they're using. (Web pages claiming Latin 1 but being CP 1252
does not exhaust the possibilities.)

Reply all
Reply to author
Forward
0 new messages