clojure don't support .clj source code file by utf-8.

403 views
Skip to first unread message

Alex Woods

unread,
Jul 9, 2015, 11:59:45 AM7/9/15
to clo...@googlegroups.com
clojure don't support .clj source code file by utf-8.
it's ok when the .clj source code files by  ascii 

env:
windows7,jdk1.8u45,lein2.5.0

Daniel Compton

unread,
Jul 9, 2015, 3:33:46 PM7/9/15
to clo...@googlegroups.com
Hi Alex

You'll need to give us some more information about this to help us troubleshoot what's going on. Can you share the file with us?
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
--
Daniel

Denis Fuenzalida

unread,
Jul 11, 2015, 9:57:59 PM7/11/15
to clo...@googlegroups.com
I was able to reproduce an error involving Windows 7 and UTF-8 in a virtual machine with VirtualBox 4.3 (not sure if it is the issue that Alex experienced though):

* Installed Windows 7, then used Ninite.com to install Notepad++ (text editor), Oracle JDK 8 (1.8.0_45). Installed Leiningen 2.5.1 as .bat file from the website.
* Created a new leiningen project with "lein new app utf8test"
* Opened the file src/utf8test/core.clj in Notepad++ and replaced the contents of with the following:

(ns utf8test.core (:gen-class))
(defn saludo-año [año] (str "Saludos en el año " año))
(def saludo-japonés "どうもありがとう")
(defn -main [& args]
  (println (saludo-año 2015))
  (println saludo-japonés))

* On Notepad++ went to the Encoding menu and selected "Encoding in UTF-8 w/o BOM". Saved the file. When running "lein run" on the cmd.exe console it works but it outputs garbage instead of any non-ascii character (see http://i.imgur.com/H0rngyq.png)

* To trigger the compilation error, change the encoding of the file in Notepad++ to "Encoding in UTF-8". Save the file. When running "lein run" this time it will not compile and complains about being unable to resolve a symbol (see http://i.imgur.com/3SHegTH.png) ... however, if you type the contents of the file in the cmd.exe console (with "type src\utf8test\core.clj") you'll see there's some extra garbage chars before the namespace declaration.

My theory is that such garbage chars are the Byte Order Mark (BOM) Unicode character (https://en.wikipedia.org/wiki/Byte_order_mark) and they are not being correctly handled in Windows somewhere in the stack.

I don't use Windows regularly and I never had UTF-8 issues on Linux though.

--

Denis Fuenzalida

Sungjin Chun

unread,
Jul 12, 2015, 7:39:50 PM7/12/15
to clo...@googlegroups.com
On Mac OS X (Yosemite) and Linux (Ubuntu), this code works well (I'm using en_US.UTF-8 as
charset and encoding for my system).

I suspect that the OS (Windows) or its configuration is the source of the problem.

Luc Prefontaine

unread,
Jul 12, 2015, 7:54:26 PM7/12/15
to clo...@googlegroups.com
Windows a problem ?
Naaaa, impossible :)))

Luc P.

Sent from my iPhone

Baishampayan Ghose

unread,
Jul 12, 2015, 9:07:26 PM7/12/15
to Clojure Group
Hi,

IIRC Windows requires UTF-8 encoded files to have the BOM (Byte Order Mark).
Can you verify that your file has the BOM?

Regards,
BG
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clo...@googlegroups.com
> Note that posts from new members are moderated - please be patient with your
> first post.
> To unsubscribe from this group, send email to
> clojure+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to clojure+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
Baishampayan Ghose
b.ghose at gmail.com

Avi Avicenna

unread,
Jul 12, 2015, 10:26:10 PM7/12/15
to clo...@googlegroups.com
I followed steps described by Denis Fuenzalida in my Windows 7 machine and I can completely reproduce the results.

So, in Windows 7, the solution is the .clj files must be saved with UTF-8 without BOM encoding.

Sungjin Chun

unread,
Jul 13, 2015, 12:46:57 AM7/13/15
to clo...@googlegroups.com
Of course not. My files do not have BOM. So the problem lies in the
BOM thingy?

You received this message because you are subscribed to a topic in the Google Groups "Clojure" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/clojure/Rk5JGhq-IJY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to clojure+u...@googlegroups.com.

Luc Préfontaine

unread,
Jul 13, 2015, 9:52:13 AM7/13/15
to clo...@googlegroups.com
BG is right on it. I hit this problem a decade ago (roughly :)).
UTF-8 files with no BOM are not handled properly on windows.
It assumes that they are ASCII coded. That works partially (both character sets have the same
encoding for many characters) but eventually fails.

Make sure that the files have a BOM. You can do this on a per file basis using an IDE
(Eclipse, ...) or if you can use bash scripts to do this if you have access to a u*x environment.
I did not find an equivalent native windows tool but they might be some to do this in batch.

Luc P.
Luc Préfontaine<lprefo...@softaddicts.ca> sent by ibisMail!

David Powell

unread,
Jul 13, 2015, 10:23:52 AM7/13/15
to clojure
* On Notepad++ went to the Encoding menu and selected "Encoding in UTF-8 w/o BOM". Saved the file. When running "lein run" on the cmd.exe console it works but it outputs garbage instead of any non-ascii character (see http://i.imgur.com/H0rngyq.png)

This is as expected.
Garbage characters are output because *out* is bound to the platform default encoding.  The platform default encoding will never be UTF-8 on Windows - it is likely to be something like Windows-1252, which is incapable of encoding those characters.

* To trigger the compilation error, change the encoding of the file in Notepad++ to "Encoding in UTF-8". Save the file. When running "lein run" this time it will not compile and complains about being unable to resolve a symbol (see http://i.imgur.com/3SHegTH.png) ... however, if you type the contents of the file in the cmd.exe console (with "type src\utf8test\core.clj") you'll see there's some extra garbage chars before the namespace declaration.

This is because the BOM is not a valid character in the Clojure syntax.
Perhaps it would be a reasonable enhancement for Clojure to treat the BOM as whitespace.
 

David Powell

unread,
Jul 13, 2015, 10:27:39 AM7/13/15
to clojure

On Mon, Jul 13, 2015 at 2:52 PM, Luc Préfontaine <lprefo...@softaddicts.ca> wrote:
BG is right on it. I hit this problem a decade ago (roughly :)).
UTF-8 files with no BOM are not handled properly on windows.
It assumes that they are ASCII coded. That works partially (both character sets have the same
encoding for many characters) but eventually fails.

Make sure that the files have a BOM. You can do this on a per file basis using an IDE
(Eclipse, ...) or if you can use bash scripts to do this if you have access to a u*x environment.
I did not find an equivalent native windows tool but they might be some to do this in batch.

Luc P.

Clojure source files are expected to be in UTF-8 and Clojure on Windows doesn't require a BOM.

In fact, Clojure files must not contain a BOM because it isn't considered to be whitespace by the clojure parser and will cause the error "Unable to resolve symbol: ? in this context".

Some software, such as Windows notepad uses the presence of a BOM to detect UTF-8, but that can be overridden in the File | Open dialog.  Other than that, the behaviour of the BOM on Clojure between Linux and Windows should be the same - this stuff is all handled by Java code in the JDK - not by the Windows platform.


Luc Préfontaine

unread,
Jul 13, 2015, 10:56:11 AM7/13/15
to clo...@googlegroups.com
I cannot remember the details but in 2010 I had similar problem in a cross-platform project
using Clojure. And problems earlier in another cross-platform/cross-language project.

So it's the reverse way, no BOM at all...

Can't believe we are in 2015 still struggling with character set issues.
Having to to think about this when saving a file in notepad...That's depressing.
No wonder why I now stay away from Windows as much as possible.

I can't understand why we cannot get some transparent behavior from the Java runtime.
These are human readable text files. Not some unreadable binary format.
Googled a bit about this and numerous people face this problem reading windows generated
files. They all ended up having to skip the BOM if present when reading the file.

So much for portability. Beurk.

Sungjin Chun

unread,
Jul 13, 2015, 6:46:30 PM7/13/15
to clo...@googlegroups.com
Assume that charset is the same, even this case, there're many types of encoding scheme for it and for portability,
you have to consider both input and output encoding. On Mac OS X or Linux, this is controlled by locale system,
on windows 1. you can force encoding system using control panel or you have to change your encoding before
output to console. Here, we in korea, do this stuffs for internationalized application development. Of course, you have
to use correct charset for i18n application :-)

You received this message because you are subscribed to a topic in the Google Groups "Clojure" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/clojure/Rk5JGhq-IJY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to clojure+u...@googlegroups.com.

Luc Prefontaine

unread,
Jul 13, 2015, 8:11:48 PM7/13/15
to clo...@googlegroups.com
I agree that the number of encodings makes a full proof transparent solution impossible to implement.

I still think that some simpler text file handling out of the box should exist on the JVM to read utf files.

Utf-8 is kind of natural within the JVM.

Exposing all this BOM machinery every time you need to read a text file is a pain.

Either implement BOM recognition on the fly or make it mandatory in utf-8 files every where.

The BOM is required for utf-16 and above as far as I know.

The time spent on stupid issues like this one must be significant given the number of people struggling with this...


Sent from my iPhone
Reply all
Reply to author
Forward
0 new messages