Parsing XML containing UTF8 with clojure.xml/parse

41 views
Skip to first unread message

Meikel Brandmeyer

unread,
Aug 12, 2009, 4:16:06 AM8/12/09
to Clojure
Dear Clojurians,

I have to parse some XML files with c.x/parse. However the files
contain UTF-8 characters, which end up as '?' after being parsed by
c.x/parse. Is there some possibility to correctly parse the files? I
suspect there is some settings somewhere in my Clojure/JVM/System
which makes the whole thing fail, but I have no clue how to find out
where to look...

Sincerely
Meikel

Stephen C. Gilardi

unread,
Aug 12, 2009, 9:57:54 AM8/12/09
to clo...@googlegroups.com

Hi Meikel,

I haven't used clojure.xml yet.

Does this help get you going:

http://groups.google.com/group/clojure/msg/0f6dc9ec66b852fe

More generally, you should also be able to specify the encoding by
arranging for an InputStreamReader with a properly specified
"charset" (like "UTF8") to wrap your input byte source.

--Steve

Meikel Brandmeyer

unread,
Aug 12, 2009, 10:22:14 AM8/12/09
to Clojure
Hi Stephen,

On Aug 12, 3:57 pm, "Stephen C. Gilardi" <squee...@mac.com> wrote:
> > I have to parse some XML files with c.x/parse. However the files
> > contain UTF-8 characters, which end up as '?' after being parsed by
> > c.x/parse. Is there some possibility to correctly parse the files? I
> > suspect there is some settings somewhere in my Clojure/JVM/System
> > which makes the whole thing fail, but I have no clue how to find out
> > where to look...

> Does this help get you going:
>
>        http://groups.google.com/group/clojure/msg/0f6dc9ec66b852fe

Thanks for the tip. Unfortunately, it doesn't help. Now everything is
completely chopped to pieces.

> More generally, you should also be able to specify the encoding by  
> arranging for an InputStreamReader with a properly specified  
> "charset" (like "UTF8") to wrap your input byte source.

I tried, but c.x/parse only accepts an InputStream. I didn't find
a way to set the charset and that one...

Sincerely
Meikel

B Smith-Mannschott

unread,
Aug 12, 2009, 10:30:10 AM8/12/09
to clo...@googlegroups.com
Hi Meikel,

Please post code. Show us what you are trying to do, so we can help
instead of just guessing.

// Ben

B Smith-Mannschott

unread,
Aug 12, 2009, 10:31:47 AM8/12/09
to clo...@googlegroups.com

You shouldn't have to. XML is funny that way:

InputStream is a stream of *bytes*, not characters. XML will try to
parse as UTF-8 if it doesn't find a <?xml ... ?> header specifying
some other encoding. So, in your case it should "just work" unless the
files I believe to be UTF-8 aren't actually UTF-8.

// Ben

Meikel Brandmeyer

unread,
Aug 12, 2009, 11:27:58 AM8/12/09
to Clojure
Hi,

On Aug 12, 4:30 pm, B Smith-Mannschott <bsmith.o...@gmail.com> wrote:

> Please post code. Show us what you are trying to do, so we can help
> instead of just guessing.

I have a file which looks roughly like this:
<?xml version="1.0" encoding="UTF-8"?>
<foo bar="<stuff I believe to be UTF-8 here>"/>

In fact things are more complicated, but also happens
when I change the file to the above form. Unfortunately
I cannot share the failing file, since this is confidential
information of my company.

However, I get the feeling Clojure is not the problem.
I noticed, I forgot the "UTF-8" on the *output*. beh..
Now, Vim seems to be happy with the file.

This leads me to the conclusions, that
a) Stephen's link above was solving the problem and
b) Excel is terribly annoying.

Off-topic: can I import UTF-8 CSVs in Excel?

Sorry for the noise and many thanks for your and
Stephen's efforts.

Sincerely
Meikel

Richard Newman

unread,
Aug 12, 2009, 3:51:17 PM8/12/09
to clo...@googlegroups.com
> However, I get the feeling Clojure is not the problem.
> I noticed, I forgot the "UTF-8" on the *output*. beh..
> Now, Vim seems to be happy with the file.

I had a similar issue -- every Java component (e.g., Nailgun, which
VimClojure uses) needs to be started with the right character set.

http://www.holygoat.co.uk/blog/entry/2009-07-22-1

Reply all
Reply to author
Forward
0 new messages