Unicode, accented characters

152 views
Skip to first unread message

max3000

unread,
Mar 6, 2009, 5:31:22 PM3/6/09
to Clojure
Hi,

I'm trying to output accented characters from clojure. Actually, I'm
trying to call setToolTipText on a JComponent with some unicode
string. No problems doing so from Java, but with clojure I'm hitting a
wall.

In REPL:
exmentis=> "àéôö"
"&→∟↔"
exmentis=> \u00f4
\├┤
exmentis=> \u00c0
\À

Ok, so the Reader doesn't read my input correctly, right? Let's make a
Java static member:
class Application {
public static final String abc = "àààèèè";
}
(Application is loaded in REPL by being on the classpath.)

exmentis=> Application/abc
"àààèèè"
exmentis=> (println Application/abc)
àààèèè
nil
exmentis=> (. System/out println Application/abc)
αααΦΦΦ
nil

(note the difference in output)

Hmmm... not better. BTW, doing System.out.println(Application.abc)
from Java outputs the correct result.


To put it simply, whenever I define a unicode String in Java, I obtain
the right result as long as I "stay" on the Java side. For instance

(. component setToolTipText Application/abc) ;; works
(. component setToolTipText "àéö") ;; doesn't work (garbled output)

What am I missing? Clojure uses Java strings, right?
exmentis=> (instance? String "àéè")
true

Why is the Reader not able to understand the string correctly then?

Max

P.S. I showed REPL output in this post but the way I actually do
things is to load clojure scripts from Java with RT.loadResourceScript
(). It doesn't work this way either.

(BTW, my REPL is:
java -cp %ADD_CP%;%CLOJURE_DIR%\jline-0.9.94.jar;%CLOJURE_JAR%
jline.ConsoleRunner clojure.lang.Repl
where ADD_CP is some additional classpaths.)

Meikel Brandmeyer

unread,
Mar 6, 2009, 5:38:35 PM3/6/09
to clo...@googlegroups.com
Hi,

Am 06.03.2009 um 23:31 schrieb max3000:

> I'm trying to output accented characters from clojure. Actually, I'm
> trying to call setToolTipText on a JComponent with some unicode
> string. No problems doing so from Java, but with clojure I'm hitting a
> wall.
>
> In REPL:
> exmentis=> "àéôö"
> "&→∟↔"
> exmentis=> \u00f4
> \├┤
> exmentis=> \u00c0
> \À

Clojure
user=> "àéñî"
"àéñî"

Works For Me(tm).

> Ok, so the Reader doesn't read my input correctly, right?

No. I think it reads it correctly.

> jline.ConsoleRunner

I got a bit weird behaviour with jline.ConsoleRunner.
Try without it. That worked for me.

Sincerely
Meikel

max3000

unread,
Mar 6, 2009, 5:58:06 PM3/6/09
to Clojure
I'm getting similar results without jline.ConsoleRunner. Also as I
mentioned I use RT.loadResourceScript and get the same results.

However, I'm using the clojure release from 2008-12-17 (the only
official version I see at http://code.google.com/p/clojure/downloads/list.)
Could that make a difference?

I don't really want to use the SVN version because I'm developing an
application and can really do without the (normal) instabilities that
come with development builds.

Thanks,

Max
>  smime.p7s
> 5KViewDownload

Kevin Downey

unread,
Mar 6, 2009, 6:16:51 PM3/6/09
to clo...@googlegroups.com
Jline is known to have issues with unicode.
--
And what is good, Phaedrus,
And what is not good—
Need we ask anyone to tell us these things?

RZez...@gmail.com

unread,
Mar 6, 2009, 8:03:27 PM3/6/09
to Clojure
On Mar 6, 5:58 pm, max3000 <maxime.lar...@gmail.com> wrote:
>
> I don't really want to use the SVN version because I'm developing an
> application and can really do without the (normal) instabilities that
> come with development builds.
>

FYI, you may want to consider using SVN for now because there have
been breaking changes[1] since the last release. The general
consensus seems to be that breaking changes are allowed until version
1.0 is released. Most of the commits are bug fixes, so IMO it only
gets more stable, not less.

1: http://clojure.org/lazier

max3000

unread,
Mar 6, 2009, 9:34:31 PM3/6/09
to Clojure
There is definitely a bug. In r994 (Aug 07, 2008) UTF8 encoding was
added to *in*, *out* and *err*. This messes up the Repl (and the
Reader in general) as discussed above.

Case in point, everything works fine when I go in the code and modify
RT.java as follows:

final static public Var OUT =
Var.intern(CLOJURE_NS, Symbol.create("*out*"), new OutputStreamWriter
(System.out)); //, UTF8));
final static public Var IN =
Var.intern(CLOJURE_NS, Symbol.create("*in*"),
new LineNumberingPushbackReader(new InputStreamReader
(System.in))); //, UTF8)));
(UTF8 commented out)

Anyway this could be seen as a bug? Should I report it? Was this made
like this for a reason?

Thanks,

Max

max3000

unread,
Mar 6, 2009, 10:21:12 PM3/6/09
to Clojure
Some more information:

In REPL, everything seems fine:

exmentis=> "ààà"
"ààà"
exmentis=> (def a "àààà")
#'exmentis/a
exmentis=> a
"àààà"
exmentis=> (println a)
àààà
nil
exmentis=> (. System/out println a)
àààà
nil


However, when I import a Java class:

public class Application {
public static final String abc = "àààèèè";
...

user=> (import '(application Application))
nil
user=> Application/abc
"αααΦΦΦ"

So there is something more to the problem than what I said above...

I tried (blindly) modifying other places that used UTF8 to use the
default encoding. It didn't change anything.

My time is up for tonight! ;( Any help would be appreciated.

Thanks,

Max

max3000

unread,
Mar 7, 2009, 8:43:14 AM3/7/09
to Clojure
Ok, so I ended up doing this in my code:

String resource = "/exmentis/rules_main.clj";
InputStream is = getClass().getResourceAsStream(resource);
String script = ... read in is as a String (like slurp) ...
StringReader r = new StringReader(script);
clojure.lang.Compiler.load(r, null, resource);

Note I use clojure.lang.Compiler directly because RT has no methods to
do what I want.

The above works fine, and requires no modifications to the clojure
source code.

I still would like some 'official' explanation and whether my way of
doing things is 'blessed' by M. Hickey and the community.

Thanks,

Max

Toralf Wittner

unread,
Mar 7, 2009, 10:03:26 AM3/7/09
to clo...@googlegroups.com
On Sat, 2009-03-07 at 05:43 -0800, max3000 wrote:
> Ok, so I ended up doing this in my code:
>
> String resource = "/exmentis/rules_main.clj";
> InputStream is = getClass().getResourceAsStream(resource);
> String script = ... read in is as a String (like slurp) ...
> StringReader r = new StringReader(script);
> clojure.lang.Compiler.load(r, null, resource);
>
> Note I use clojure.lang.Compiler directly because RT has no methods to
> do what I want.
>
> The above works fine, and requires no modifications to the clojure
> source code.

Hi Max,

Please tell us a bit about your environment (locale settings, OS). It
looks to me like your settings are different from UTF-8 and the reason
why the above procedure works is because Java will use the default
character set when decoding your source file. Within Java (or Clojure)
you can get the default character set with:

(java.nio.charset.Charset/defaultCharset)

which in my case produces #<UTF_8 UTF-8>. If you are using a different
character set (e.g. ISO-8859-1), some characters can not be mapped
directly between this and UTF-8. While I am not aware of any explicit
requirements regarding Clojure source file encodings, it seems that de
facto UTF-8 is assumed. Try encoding your sources as UTF-8 and things
should work as expected.

Cheers,
Toralf


max3000

unread,
Mar 7, 2009, 1:21:13 PM3/7/09
to Clojure
The default character set on WinXP (which I use) is windows-1252
(cp1252). Check out http://www.rgagnon.com/javadetails/java-0505.html.

If I were to change my source file encodings to UTF-8 that would
probably get me some mileage. Of course, I would have to use an editor
that supports it and not all editors would (on windows). However, it
wouldn't change anything in the REPL. Presumably, stdin in Java is
tied to the platform's default encoding and there is probably no way
to change that. My understanding is that clojure assumes reading a
file and reading stdin is the same thing encoding-wise. That's a
faulty assumption.

Typically, I believe clojure should read and write to/from the default
character set unless specifically told otherwise. UTF-8 is not the
default on all platforms.

Thanks,

Max

max3000

unread,
Mar 13, 2009, 2:46:24 AM3/13/09
to Clojure
Any news on this item? Does what I'm saying make sense?

I understand most people who use clojure are probably English-speaking
and couldn't care less about internationalization, but this has to be
addressed if clojure is to get any semblance of semi-mainstream
adoption. In fact, one of the reasons I chose clojure myself is
because internationalization is a solved problem in Java (and hence I
though in clojure as well). If the perception is that the problem is
"limited" to Windows, well, that's 90% of the deployed PCs out there.

Since the fix seems so trivial and requires changes in only about 5
lines of code, I'm not sure what prevents this from being fixed. At
least, is there a clojure bug tracking site where I could add this
issue?

Thanks,

Max


On Mar 7, 2:21 pm, max3000 <maxime.lar...@gmail.com> wrote:
> The default character set on WinXP (which I use) is windows-1252
> (cp1252). Check outhttp://www.rgagnon.com/javadetails/java-0505.html.

Michael Wood

unread,
Mar 13, 2009, 3:09:25 AM3/13/09
to clo...@googlegroups.com
On Fri, Mar 13, 2009 at 8:46 AM, max3000 <maxime...@gmail.com> wrote:
>
> Any news on this item? Does what I'm saying make sense?
>
> I understand most people who use clojure are probably English-speaking
> and couldn't care less about internationalization, but this has to be
> addressed if clojure is to get any semblance of semi-mainstream
> adoption. In fact, one of the reasons I chose clojure myself is
> because internationalization is a solved problem in Java (and hence I
> though in clojure as well). If the perception is that the problem is
> "limited" to Windows, well, that's 90% of the deployed PCs out there.
>
> Since the fix seems so trivial and requires changes in only about 5
> lines of code, I'm not sure what prevents this from being fixed. At
> least, is there a clojure bug tracking site where I could add this
> issue?

What happens if you do this:

;; untested
(binding [*in* (new LineNumberingPushbackReader (new InputStreamReader
System/in))
*out* (new OutputStreamWriter System/out)]
(your-code-here))

--
Michael Wood <esio...@gmail.com>

max3000

unread,
Mar 13, 2009, 3:26:41 AM3/13/09
to Clojure
I paste below what I got. It's doing something for sure, but just for
*out* I think.

Thanks.


user=> (println "ààààabcd")
����abcd
nil
user=> (binding [*in* (new LineNumberingPushbackReader (new
InputStreamReader
System/in))
*out* (new OutputStreamWriter System/out)]
(println "ààààabcd"))
????abcd
nil
user=>
(binding [*in* (new LineNumberingPushbackReader (new InputStreamReader
System/in))]
(println "ààààabcd"))
����abcd
nil
user=> (binding [*out* (new OutputStreamWriter System/out)]
(println "ààààabcd"))

????abcd
nil


On Mar 13, 3:09 am, Michael Wood <esiot...@gmail.com> wrote:
> Michael Wood <esiot...@gmail.com>

Christophe Grand

unread,
Mar 13, 2009, 5:59:12 AM3/13/09
to clo...@googlegroups.com
here is some background info on the change:
http://groups.google.com/group/clojure/browse_thread/thread/123ef17d7c650018/e1da76a4a273aa5a

max3000 a écrit :


> The default character set on WinXP (which I use) is windows-1252
> (cp1252). Check out http://www.rgagnon.com/javadetails/java-0505.html.
>
> If I were to change my source file encodings to UTF-8 that would
> probably get me some mileage. Of course, I would have to use an editor
> that supports it and not all editors would (on windows). However, it
> wouldn't change anything in the REPL. Presumably, stdin in Java is
> tied to the platform's default encoding and there is probably no way
> to change that. My understanding is that clojure assumes reading a
> file and reading stdin is the same thing encoding-wise. That's a
> faulty assumption.
>

I think that forcing source files to be UTF-8 is a good thing when
sharing code.

On the subject of the REPL, what about adding a property (eg
"clojure.repl.encoding") to override the default? (I'm still ambivalent
on this subject: I don't know which default (utf-8 or plaftorm) is best.)

Christophe


--
Professional: http://cgrand.net/ (fr)
On Clojure: http://clj-me.blogspot.com/ (en)


Reply all
Reply to author
Forward
0 new messages