How do you parse xml files with different encodings?

211 views
Skip to first unread message

Daniel Jomphe

unread,
Apr 2, 2009, 10:04:55 PM4/2/09
to Clojure
Let's say I have this file to parse:

<?xml version="1.0" encoding="ISO-8859-1"?>
<root>Québécois français</root>

I spent many hours trying different ways of doing it, but still
haven't found one. Here are probably my best attempts:

(def n "ISO-8859-1")

(defmacro with-out-encoded [encoding & body]
`(binding [*out* (java.io.OutputStreamWriter. System/out
~encoding)]
~@body))

(:content (clojure.xml/parse "french.xml"))
; ["Québécois français"]

(with-out-encoded n
(:content (clojure.xml/parse "french.xml")))
; ["Québécois français"]

(with-out-encoded n
(let [x (org.xml.sax.InputSource. (java.io.FileInputStream.
"french.xml"))]
(do
(.setEncoding x n)
(:content (clojure.xml/parse x)))))
; ["Québécois français"]

Some xml files appear different when I run them through these lines;
this one does not. In all cases, though, I don't get what's really
expected.

How would go about it?

Daniel Jomphe

unread,
Apr 3, 2009, 10:20:39 AM4/3/09
to Clojure
Since I can't find the way to solve this issue, let's tackle it at a
more fundamental level.

First, I need to make sure I can print to standard output without
using *out*, so I can later, temporarily bind *out* to something else.
Also, I don't want to print directly to System/out. Later, I'll need
it to be wrapped into something that gives it a proper encoding.

user=> (import '(java.io PrintWriter PrintStream))
nil
user=> (.print (PrintWriter. (PrintStream. System/out) true)
"bonjour")
nil

Do you know how to make this work?

Kyle Schaffrick

unread,
Apr 3, 2009, 10:30:26 AM4/3/09
to clo...@googlegroups.com
On Fri, 3 Apr 2009 07:20:39 -0700 (PDT)
Daniel Jomphe <daniel...@gmail.com> wrote:
>
> Since I can't find the way to solve this issue, let's tackle it at a
> more fundamental level.
>
> First, I need to make sure I can print to standard output without
> using *out*, so I can later, temporarily bind *out* to something else.
> Also, I don't want to print directly to System/out. Later, I'll need
> it to be wrapped into something that gives it a proper encoding.
>
> user=> (import '(java.io PrintWriter PrintStream))
> nil
> user=> (.print (PrintWriter. (PrintStream. System/out) true)
> "bonjour")
> nil
>
> Do you know how to make this work?
>

I'm not sure if this is what you're asking, but...

Could you just make a new Var (like *always-stdout* or something) and
assign *out* to it at program start? This way the dynamic bindings on
*out* don't affect your new Var and you can continue using it to access
the "real" stdout.

I haven't tested this, just a suggestion :)

-Kyle

Daniel Jomphe

unread,
Apr 3, 2009, 10:33:22 AM4/3/09
to Clojure
Oh, somehow, auto-flushing doesn't work, and that's why I wasn't
seeing anything else than nil:

user=> (def w (PrintWriter. (PrintStream. System/out) true))
#'user/w
user=> (.print w "bonjour")
nil
user=> (.flush w)
bonjournil

So although I asked for auto-flushing, I need to do so manually.

Daniel Jomphe

unread,
Apr 3, 2009, 10:39:12 AM4/3/09
to Clojure
Kyle Schaffrick wrote:
> Could you just make a new Var (like *always-stdout* or something) and
> assign *out* to it at program start? This way the dynamic bindings on
> *out* don't affect your new Var and you can continue using it to access
> the "real" stdout.

As long as I'm willing to lose the convenience of (print "string"), I
could. It might be advisable to do so instead of binding *out* in a
with- macro like I did; I'm not sure if it's a bad idea or not.

In any case, I think doing so wouldn't solve my encoding issues.

Daniel Jomphe

unread,
Apr 3, 2009, 10:57:02 AM4/3/09
to Clojure
Now that I know for sure how to bind *out* to something else over
System/out, it's time to bring back my encoding issues into scope:

(import '(java.io PrintWriter PrintStream))

(defmacro with-out-encoded
[encoding & body]
`(binding [*out* (java.io.PrintWriter. (java.io.PrintStream.
System/out true ~encoding) true)]
~@body
(flush)))

(def nc "ISO-8859-1")

;;; with a normal string
(def s "québécois français")

(print s)
; quÔøΩbÔøΩcois franÔøΩaisnil

(with-out-encoded nc (print s))
; qu?b?cois fran?aisnil

;;; with a correctly-encoded string
(def snc (String. (.getBytes s nc) nc))

(print snc)
; qu?b?cois fran?aisnil

(with-out-encoded nc (print snc))
; qu?b?cois fran?aisnil

I'm certainly missing something fundamental somewhere.

Paul Stadig

unread,
Apr 3, 2009, 11:02:58 AM4/3/09
to clo...@googlegroups.com
Works For Me (TM).

user=> (def s "québécois français")
#'user/s
user=> (print s)
québécois françaisnil

Are you running on Windows, Mac, or Linux? Using the Sun JVM? Which revision of Clojure?

paul@pstadig-laptop:~$ lsb_release -a
No LSB modules are available.
Distributor ID:    Ubuntu
Description:    Ubuntu 8.10
Release:    8.10
Codename:    intrepid

paul@pstadig-laptop:~$ java -version
java version "1.6.0_10"
Java(TM) SE Runtime Environment (build 1.6.0_10-b33)
Java HotSpot(TM) Server VM (build 11.0-b15, mixed mode)

paul@pstadig-laptop:~/clojure$ svn info
Path: .
URL: http://clojure.googlecode.com/svn/trunk
Repository Root: http://clojure.googlecode.com/svn
Repository UUID: a41a16f3-9855-0410-b325-31a011a03e7c
Revision: 1336
Node Kind: directory
Schedule: normal
Last Changed Author: richhickey
Last Changed Rev: 1335
Last Changed Date: 2009-03-19 09:51:32 -0400 (Thu, 19 Mar 2009)


Paul

Daniel Jomphe

unread,
Apr 3, 2009, 11:07:01 AM4/3/09
to Clojure
Sorry for all these posts.

I pasted my last post's code into a fresh repl (not in my IDE), and
here's what I got (cleaned up):

#'user/s
québécois françaisnil
qu?b?cois fran?aisnil

#'user/snc
québécois françaisnil
qu?b?cois fran?aisnil

I'm not sure what to make out of it.

My terminal (Apple Terminal) supports the encoding, and prints
correctly s and snc out of the box.
When I use with-out-encoded, I actually screw up both s and snc's
printing.

Daniel Jomphe

unread,
Apr 3, 2009, 11:15:50 AM4/3/09
to Clojure
I'm running on Mac.

In IDE (IntelliJ):
Java: Apple's 1.6
Clojure: LaClojure's version (I don't know)

In Terminal:
Java: Apple's 1.5.0_16
Clojure: 1338, 2009-04-01

Daniel Jomphe

unread,
Apr 3, 2009, 11:37:32 AM4/3/09
to Clojure
I tried under eclipse.

Default console encoding configuration (MacRoman):

#'user/s
quÔøΩbÔøΩcois franÔøΩaisnil
qu?b?cois fran?aisnil

#'user/snc
qu?b?cois fran?aisnil
qu?b?cois fran?aisnil

Console configured to print using ISO-8859-1:

#'user/s
qu�b�cois fran�aisnil
qu?b?cois fran?aisnil

#'user/snc
qu?b?cois fran?aisnil
qu?b?cois fran?aisnil

Console configured to print using UTF-8:

#'user/s
québécois françaisnil
québécois françaisnil

#'user/snc
québécois françaisnil
québécois françaisnil

So as I come to understand it, it looks like UTF-8 should be the rolls-
royce for my needs.

May I correctly conclude the following?

Don't bother about encodings unless you're displaying something and
it's unreadable; then, don't bother about it in the code; find a
proper console or viewer.

Doesn't that sound like offloading a problem to users? Isn't there
something reliable that can be done in the code?

Daniel Jomphe

unread,
Apr 3, 2009, 4:08:19 PM4/3/09
to Clojure
I made some progress.

[By the way, NetBean's console displays *everything* 100% fine.
I decided to use one of the worst repl consoles: that of IntelliJ.
I want to make sure I really understand what's the point behind all
this.]

(import '(java.io PrintWriter PrintStream FileInputStream)
'(java.nio CharBuffer ByteBuffer)
'(java.nio.charset Charset CharsetDecoder CharsetEncoder)
'(org.xml.sax InputSource))

(def utf8 "UTF-8")
(def d-utf8 (.newDecoder (Charset/forName utf8)))
(def e-utf8 (.newEncoder (Charset/forName utf8)))

(def latin1 "ISO-8859-1")
(def d-latin1 (.newDecoder (Charset/forName latin1)))
(def e-latin1 (.newEncoder (Charset/forName latin1)))

(defmacro with-out-encod
[encoding & body]
`(binding [*out* (PrintWriter. (PrintStream. System/out true
~encoding) true)]
~@body
(flush)))

(def s "québécois français")

(print s) ;quÔøΩbÔøΩcois franÔøΩaisnil
(with-out-encod latin1 (print s)) ;qu?b?cois fran?aisnil
(with-out-encod utf8 (print s)) ;qu?b?cois fran?aisnil

(def encoded (.encode e-utf8
(CharBuffer/wrap "québécois français")))
(def s-d
(.toString (.decode d-utf8 encoded)))

(print s-d) ;quÔøΩbÔøΩcois franÔøΩaisnil
(with-out-encod latin1 (print s-d)) ;qu?b?cois fran?aisnil
(with-out-encod utf8 (print s-d)) ;qu?b?cois fran?aisnil

(def f-d
(:content (let [x (InputSource. (FileInputStream. "french.xml"))]
(.setEncoding x latin1)
(clojure.xml/parse x))))

(print f-d) ;quÔøΩbÔøΩcois franÔøΩaisnil
(with-out-encod latin1 (print f-d)) ;québécois français
(with-out-encod utf8 (print f-d)) ;québécois français

So my theory, which is still almost certainly wrong, is:

1. When the input is a file whose encoding is, say, latin-1, it's easy
to decode it and then encode it however one wants.
2. When the input is a literal string in the source file, it looks
like it's impossible to encode it correctly, unless one first decodes
it from the source file's encoding. But then, I don't yet know how to
do this without actually reading the source file. :\

Daniel Jomphe

unread,
Apr 6, 2009, 3:07:38 PM4/6/09
to Clojure
I finally worked it all out.

For future reference, here's a record of my research on this:
http://stackoverflow.com/questions/715958/how-do-you-handle-different-string-encodings
Reply all
Reply to author
Forward
0 new messages