The code
test() -> "a∞b".
Compiles to code which returns the list
of integers [97,226,136,158,98].
This is very inconvenient. I had expected it to return
[97, 8734, 98]. The length of the list should be 3 not 5
since it contains three unicode characters not five.
Is this a bug or a horrible misfeature?
So how can I make a string with the three characters 'a' 'infinity' 'b'
test() -> "a\x{221e}b" is ugly
test() -> <<"a∞b"/utf8>> seems to be a bug
it gives an error in the
shell but is ok in compiled code and
returns
<<97,195,162,194,136,194,158,98>> which is
very strange
test() -> [$a,8734,$b] is ugly
/Joe
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions
You saved your source file as UTF-8, so between the two double-quotes,
the source file contains exactly those bytes. But the Erlang compiler
assumes your source code is Latin-1, so it thinks that you wrote a
Latin-1 string of 5 characters (some of which are non-printing). There's
as yet no support for telling the compiler that the input is anything
else than Latin-1, so you can't save your source files as UTF-8. (One
thing you can do is put the UTF-8 strings in another file and read them
at runtime.)
> test() -> <<"a∞b"/utf8>> seems to be a bug
Try <<"åäö"/utf8>>. It works, but like your first example, the source
string is limited to Latin-1. Strings entered in the shell may be
interpreted differently though, depending on your locale settings.
/Richard
> On 07/30/2012 02:35 PM, Joe Armstrong wrote:
>> What is a literal string in Erlang? Originally it was a list of
>> integers, each integer
>> being a single character code - this made strings very easy to work with
>>
>> The code
>>
>> test() -> "a∞b".
>>
>> Compiles to code which returns the list
>> of integers [97,226,136,158,98].
>>
>> This is very inconvenient. I had expected it to return
>> [97, 8734, 98]. The length of the list should be 3 not 5
>> since it contains three unicode characters not five.
>>
>> Is this a bug or a horrible misfeature?
>
> You saved your source file as UTF-8, so between the two double-quotes, the source file contains exactly those bytes. But the Erlang compiler assumes your source code is Latin-1
I'd expect the string manipulation functions of Erlang assume that as
well (that strings are lists of "bytes"), don't they? E.g. that `words`
splits on 0x20 (and maybe 0xA0), not on the {{Zs}} general category?
No! Don't save a source file as UTF8, at least without a way of marking
up such files as being special. The problem is that if you do the trick
above, you have to ensure that you convert _all_ string literals
explicitly this way (at least if they may contain characters outside
ASCII). But if you have a character such as ö, or é, in a string and you
forget to convert explicitly from UTF8 to single code points, then that
"é" will in fact be 2 bytes, while in another module saved in Latin-1,
the string "é" that looks the same in your editor will be a single byte,
and they won't compare equal. Having modules saved with different
encodings is a recipe for disaster (in particular when it comes to
future maintenance). Erlang currently only supports Latin-1 in source
files; until that is fixed, you should keep your UTF8-data in separate
files.
/Richard
Very strange I tried that earlier, this is what happens:
$ Eshell V5.9 (abort with ^G)
1> unicode:characters_to_list([97,226,136,158,98], utf8).
[97,226,136,158,98]
The manual says the first argument is a utf8 string
/Joe
/Joe
Oh dear - you're right of course.
This means that the only portable and 100% correct way to get 'a'
'INFINITY' 'b' into a string literal
to say
"a,\x{221e},b"
"a∞b" in any form won't work if the compiler is not explicitly
told "this file is utf8"
Should the pre-processor make a rude noise and only accept latin1
printable characters?
/joe
I can write a parse transform for ux, that converts bytes into code
points. Something like this:
ustr("a∞b")
The same idea is used here:
https://github.com/freeakk/i18n#using-unicode-strings-in-source-code
But it converts strings into a ICU format: binary, utf-16. And it is a hack.
On Tue, Jul 31, 2012 at 11:36 AM, CGS <cgsmc...@gmail.com> wrote:
> There are many pros and cons for switching from Latin-1 to UTF-8 (or
> whatever else which will nullify pretty much the understanding of byte
> character). ...snip... I do
> not deny some specific projects would benefit from such a character
> encoding, but think of maintaining such a code in an international
> environment.
Also, think about having to debug a system from a remote console that
doesn't support the right encoding (that's probably long-fetched in
this day and age, but possible).
> "-encoding()" can make quite a mess in a file. Think of an open source
> project in which devs from different countries append their own code. You
> will see a lot of "-encoding()" directives in a single file.
My understanding was that there should be one and only one such
directive, at the beginning of the file. I'm not even sure if there
are any editors that can handle files with mixed encodings...
> My point here is that the string manipulation should be kept apart from the
> code itself and to have two modules for manipulating normal lists and
> IO-lists (e.g., by extending unicode module). But that would be my own
> preference.
Yes, but what do you do about string literals? They are in the code...
regards,
Vlad
It's not clear to me what you mean by a 'project',
but why should a module written by someone who wants
comments in Māori (note the macron? Latin-4 or Unicode needed)
use a module written by someone who wants comments in Swedish?
It's no worse (and no better!) than having a 'project' where
some of the files assume tabs are set every 8 characters and
some of them assume tabs are set every 4 characters. It's a
thing you need written down; it's a thing your tools need to
understand; and it's a situation that doesn't need to persist
with sources that are under your control.
> I don't think that would be the single problem, but also all the code
> that assumes that source code is latin-1. Also, tools that handle
> source code will need to be able to recognize both the old and new
> encodings, as they might need to have to work with an older version of
> a file, before the conversion.
The whole point of an -encoding directive is that it is something
that syntaxtools should handle; by the time your code gets an AST
or a token list, encodings are entirely a thing of the past.
Gambit Scheme allows different files in a program to use different
encodings. It's no big deal: _only_ the code that converts between
a stream of bytes and a stream of characters knows anything about
encodings; internally it's all Unicode.
I haven't done this yet for my Smalltalk compiler because there
are other more urgent issues (like working around C compilers that
are trying to be helpful but fail), but the design work is done and
it should leave the tokeniser running at about the same speed as
the old Latin-1-only tokeniser.
There *will* be a period when I want to keep my old Latin-1 files
(don't fix what isn't broken) but want to start using Unicode in
new work.
SWI Prolog actually lets you change the encoding within a file,
which sounds crazy but maybe Jan wanted the machinery to be there
in case someone wanted ISO 2022 support. (Because that's basically
what 2022 *is*: switching encoding aspects on the fly.)
Why should a Japanese programmer be forbidden to write in her own
script just because some of the source files that get loaded at
run time are encoded in Latin 1?
>
> Another question that needs to be answered is also what encoding will
> the source code use outside strings and quoted atoms and comments
"Encoding" is a whole-file property. If the comments are encoded in
ISO 8859-5 (ISO Cyrillic), so are the strings, and if the strings are
encoded in ISO 8859-5, so are the atoms, both quoted and unquoted.
Encoding logically concerns the interface between the tokeniser and
the external byte stream (in the Unisys ClearPath MCP systems
translation between encodings is done by the operating system before
the data become available to the program). Once the changeover has
been made, the tokeniser should think that *all* characters are
Unicode characters.
> : do
> we want atoms and variable names to be utf8 too? Because I've seen at
> least an example of code that uses extended latin-1 characters in
> those places.
That's not a problem. If a file is encoded in ISO Latin 1, then certain
Unicode characters are encoded a certain way, BUT once into the tokeniser,
nobody knows or cares what that was. If another file is encoded in UTF-8,
then certain Unicode characters are encoded in a different way, BUT once
into the tokeniser, nobody knows or cares what that was.
Encode "(a×2)÷4 = ½a" as 28,61,47,32,29,f7,34,20,3d,20,bd,61 (Latin-1)
or as 28,61,c3,97,32,29,c3,b7,34,20,3d,20,c2,bd,61 (UTF-8),
and as long as the tokeniser knows what it's getting, it should make
*no* difference to what you get, namely the list
[40,97,215,50,41,247,52,32,61,32,189,97] of integers one per Unicode
code-point. That's how it works in SWI Prolog.
> Also, what should string manipulation functions do by default, should
> they assume an encoding?
No. That would make life insanely complicated. (Well, let's face it,
Unicode is already barking mad; this would make it *rabid* barking mad.)
> I think the only way to remain sane would be
> to have a special string type, tagged with the encoding
No, that's a way to go completely crazy.
The simple way is to distinguish between an inside and an outside.
INSIDE, everything is just Unicode. OUTSIDE is where the wild
things are. Encodings are *ONLY* relevant when you switch
between text encoded as byte sequences and text represented as
Unicode code point sequences.
I mean, can you *imagine* the complexity if "0" =:= "0" fails
because the first is tagged as Latin-1 and the second is tagged
as UTF-8?
How Unicode code-point sequences are represented inside the
machine-level representation of an Erlang atom, Erlang source code
should have no reason whatever to care. They could be UTF8; they
could be UTF16; they could be SCSU; they could be BOCU; they could
be something else entirely.
Converting between strings and binaries is the one place where Erlang
source code should have any reason to care, and it does have a reason
to care. But you will perceive that it is the *binary* that needs to
be associated with an encoding, not the *string*.
of the system
>
> Would a syntactic construct like u"some string" that returns a tagged
> utf8 string help?
No. However, <<"some string"/utf8>> *would* make sense.
> There are many pros and cons for switching from Latin-1 to UTF-8 (or whatever else which will nullify pretty much the understanding of byte character). On one hand, lists:reverse/1 really messes up the characters in the list
Yes, and that's not all it messes up by any means.
- If you have a sequence of lines represented as a string with network
line terminators (CR+LF) then the reversal of that list is NOT a
sequence of lines with network line terminators (applies to ASCII)
- If you use Unicode language tags, then the reversal of a language
tag is a language tag for a different language and applies to the
wrong characters
- The reversal of a Unicode string including variant selectors (or
other character shaping codes like ZWNJ or ZWJ) is a Unicode
string including variant selectors &c applied to the wrong characters
- The reversal of a Unicode string including a directional command
and a POP DIRECTIONAL FORMATTING code is a string in which there
is a POP before anything has been pushed.
...
So simply forming code points into [base,diacritical...] packets,
reversing the packets, and then flattening *still* isn't nearly
enough to make sense of a reversed string. Indeed, I am not sure
that there *is* any way to make sense of the notion of reversing
a Unicode string.
So I do not take 'lists:reverse/1 will not reverse a Unicodepoint
string correctly' as a criticism of representing strings as lists
of Unicodepoints. NOTHING will. I don't think there is any such
thing as "correctly" reversing such a string.
There are other operations you can easily do with a list that
don't make sense for Unicode strings either. Take just one
example: splitting a string at an arbitrary position. That can
separate a directional override from its pop. And having a
distinct data type is no protection against that problem: Java
and Javascript both have opaque string datatypes, but both
allow slicing a well formed string into pieces that are not
well formed.
> (to follow the first example, the output of "a∞b" in Latin-1 is totally different from the output of lists:reverse("b∞a") in Latin-1 - the default now). On the other hand, having, for example, Polish characters like "Ą Ę Ć" or French "Ç Î" or German "Ö ß" or Turkish "Ş" and so on (things become more complicated if we add languages based on different alphabet/symbols) in the code would require your editor to have support for those languages or else you will see really strange characters there.
Well, yes. But now you are asking whether the editor supports Unicode.
There are now plenty of editors that do. Right now I am composing mail
in an unbelievably crude text editor (the Mail program on Mac OS X) and
it displays these characters just fine.
>
> "-encoding()" can make quite a mess in a file. Think of an open source project in which devs from different countries append their own code. You will see a lot of "-encoding()" directives in a single file.
Nobody is suggesting that there should be an -encoding directive anywhere
but the first line of a file (or possibly the second). In fact it is
precisely the existence of -encoding directives that would make it possible
to *avoid* the mess you are describing.
Here's what you do.
(1) Write a tiny little program. Here is a first draft.
#!/usr/bin/awk -f
# Usage: epaste.awk file1.erl... >pasted.erl
# Purpose: paste files in various encodings giving one file in UTF-8.
BEGIN {
print "-encoding(utf_8)."
for (i = 1; i < ARGC; i++) {
input = ARGV[i]
getline x < input
if (x ~ /^[ \t]*-[ \t]*encoding\([ \t']*[a-zA-Z0-9_]*[ \t']*\)/) {
sub(/^[ \t]*-[ \t]*encoding\([ \t']*/, "", x)
sub(/[ \t']*\).*$/, "", x)
x = toupper(x)
gsub(/_/, "-", x)
cmd = "iconv -f " x " -t UTF-8"
} else {
cmd = "iconv -f ISO-8859-1 -t UTF-8"
print x >cmd
}
while ((getline x <input) > 0) print x >cmd
close(cmd)
}
}
(2) Instead of pasting together several files by doing
cat foo.erl ugh.erl bar.erl >fub.erl
just do
epaste.awk foo.erl ugh.erl bar.erl >fub.erl
What makes this *possible* is the existence of the -encoding lines.
Without it you are FUBAR.
> I might be wrong, but, switching to default UTF-8, wouldn't that force the compiler to use 2-byte (at least) per character?
Yes, you are wrong. Unicode is a 21-bit character set.
There are currently (6.1) more than 100,000 defined
characters, so 2 bytes is definitely not enough.
But UTF-8 is an *external* format.
What the compiler uses is entirely up to itself.
What the run-time system uses is something different again.
Atom names, for example, could be stored in some compressed format.
> If so, for example, what about the databases based on Erlang for projects using strict Latin-1?
What about them? Do not make the mistake of confusing a
particular set of characters with a way of encoding them.
> And having a
> distinct data type is no protection against that problem: Java
> and Javascript both have opaque string datatypes, but both
> allow slicing a well formed string into pieces that are not
> well formed.
To be fair, they've got the further compounding issue that strings types
are dedicated but not opaque: they are sequences of UTF-16 code units
(on account of originally being UCS2 sequences).
As a result, not only do you have the usual Unicode issues which may or
may not be (non-trivially) solvable (with grapheme-aware unicode handling[0])
that's further compounded by the ability to see and break apart
surrogate pairs (so you can e.g. split a string in the middle of a
surrogate pair).
CPython 3.3 has implemented a fully opaque string type, it exposes unicode
codepoints (if I remember correctly) but that may or may not be the
underlying binary data (the underlying representation can dynamically switch
between latin-1, UCS2 and UCS4)
[0] Which also needs to be locale-aware, for instance a conversion to
lower/upper case is not a 1:1 mapping in unicode as different cultures
may have different uppercases for the same lower and the other way
around, the usual example being Turkish in which "I"'s lowercase is "ı"
and the uppercase of "i" is "İ")
First, thanks for the detailed explanation. I see I am still confusing
some of the issues.
On Wed, Aug 1, 2012 at 3:56 AM, Richard O'Keefe <o...@cs.otago.ac.nz> wrote:
> On 31/07/2012, at 7:36 PM, Vlad Dumitrescu wrote:
> It's not clear to me what you mean by a 'project',
I mean a set of related code, some of it possibly third-party.
> but why should a module written by someone who wants
> comments in Māori (note the macron? Latin-4 or Unicode needed)
> use a module written by someone who wants comments in Swedish?
Maybe not in the long run, but there will be a (long) transition
period where legacy code will still be used by new code.
> The whole point of an -encoding directive is that it is something
> that syntaxtools should handle; by the time your code gets an AST
> or a token list, encodings are entirely a thing of the past.
Yes, but I am one of the guys that is going to write some of the tools
that will handle this conversion, so I do care about the details.
> SWI Prolog actually lets you change the encoding within a file,
> which sounds crazy but maybe Jan wanted the machinery to be there
> in case someone wanted ISO 2022 support. (Because that's basically
> what 2022 *is*: switching encoding aspects on the fly.)
Are there any editors that can load/save a file with mixed encodings like that?
<...snip...>
> Converting between strings and binaries is the one place where Erlang
> source code should have any reason to care, and it does have a reason
> to care. But you will perceive that it is the *binary* that needs to
> be associated with an encoding, not the *string*.
> of the system
Right. Good explanation!
I am still a little worried about two things:
- debugging a remote system that has different locale
- reading logs created by modules that have different encodings (some
modules might be legacy and not be aware that the world is not Latin-1
anymore).
regards,
Vlad
> On 2012-08-01, at 06:14 , Richard O'Keefe wrote:
>
>> And having a
>> distinct data type is no protection against that problem: Java
>> and Javascript both have opaque string datatypes, but both
>> allow slicing a well formed string into pieces that are not
>> well formed.
>
> To be fair, they've got the further compounding issue that strings types
> are dedicated but not opaque: they are sequences of UTF-16 code units
> (on account of originally being UCS2 sequences).
You are right. I should not have "opaque". The implementation
is *encapsulated*, but the fact that it's a slice of an array of
16-bit units shows through.
As it happens, I *wasn't* referring to the possibility of splitting
a codepoint between two surrogates. If we restrict our attention to
the Basic Multilingual Plane, it is *still* possible to slice a
well formed BMP string into pieces that are not well formed. I have
in mind things like the way Apple used to have two plus signs, one
for left to right text and one for right to left text, but since
Unicode has only one, the way to encode א+ב was
[Aleph, left-to-right override, plus, pop directional formatting, Beth],
and a division that gives the left part either 2 or 3 codepoints is one
that gives you two strings that make no sense.
As it happens, I don't know any programming language that deals with
this. My basic point is that any data structure for text that
*doesn't* ensure that all the 'strings' you deal with are well formed
has already lost its virginity and might as well be frankly and openly
just a sequence of code points.
>
>> but why should a module written by someone who wants
>> comments in Māori (note the macron? Latin-4 or Unicode needed)
>> use a module written by someone who wants comments in Swedish?
>
> Maybe not in the long run, but there will be a (long) transition
> period where legacy code will still be used by new code.
Sorry, my typing mistake here.
What I *meant* to write was "why should a [Māori] module
*NOT* use a [Swedish] one"? You were saying, or so I thought,
that there should be one project = one encoding, and I was saying
I thought that was too restrictive in practice.
>
>> The whole point of an -encoding directive is that it is something
>> that syntaxtools should handle; by the time your code gets an AST
>> or a token list, encodings are entirely a thing of the past.
>
> Yes, but I am one of the guys that is going to write some of the tools
> that will handle this conversion, so I do care about the details.
And by the time it gets to you, there won't *be* any details to care about.
>
>> SWI Prolog actually lets you change the encoding within a file,
>> which sounds crazy but maybe Jan wanted the machinery to be there
>> in case someone wanted ISO 2022 support. (Because that's basically
>> what 2022 *is*: switching encoding aspects on the fly.)
>
> Are there any editors that can load/save a file with mixed encodings like that?
I have no idea. There are a number of editors that claim to support
ISO 2022, which does mid-stream code switching, so they could presumably
be extended to support this. See for example
A model for input and output of multilingual text in a windowing environment
by Yutaka Kataoka, Masato Morisaki, Hiroshi Kuribayashi, and Hiroyoshi Ohara
ACM Transactions on Information Systems (TOIS)
Volume 10 Issue 4, Oct. 1992
>
> I am still a little worried about two things:
> - debugging a remote system that has different locale
> - reading logs created by modules that have different encodings (some
> modules might be legacy and not be aware that the world is not Latin-1
> anymore).
Ouch. And then there are all those documents that lie about the
encoding they're using. (Web pages claiming Latin 1 but being CP 1252
does not exhaust the possibilities.)