character encoding issue in compiled .js

413 views
Skip to first unread message

Jason Hickner

unread,
Jul 31, 2011, 1:12:39 AM7/31/11
to Clojure
I'm seeing non-utf8 characters in my compiled .js even though my .cljs
source file is utf8. Here's a very short example demonstrating the
issue:

https://gist.github.com/1116419

Notice that the colon at the start of :foo is being munged during
compilation.

This causes errors when the .js is executed in the browser, although
Blackfoot pointed out on IRC that if you include a charset tag in your
html those errors will go away:

<head><meta charset="utf-8"/>...</head>

I'm on OSX 10.6, editing with vim (:set fileencoding reports utf-8 for
my source file)

- Jason




HaiColon

unread,
Aug 2, 2011, 7:05:13 AM8/2/11
to Clojure
I've encountered this problem too, on Ubuntu 11.04 with Emacs 23.2.1
and the ClojureScript repo cloned with Git 1.7.4.1.

For me, using a meta tag doesn't resolve the problem though. Without a
meta tag that sets the encoding to UTF-8, an i with two dots above it
is diplayed. With the meta tag added or when viewing the js file with
Emacs, an empty rectangle as shown in Jason's gist is displayed. This
happens with both the sample files that come with ClojureScript and
with new files.

--Chris

David Powell

unread,
Aug 2, 2011, 10:06:09 AM8/2/11
to clo...@googlegroups.com

The character at the beginning of the string isn't a corrupt ':', it is a Unicode control character '\uFDD0' which seems to be output as an internal detail so that clojurescript can distinguish keywords and strings.

The clojurescript compiler outputs javascript as utf-8.

So technically, everything is fine - but if you are serving your javascript up from a webserver, you need to ensure that the javascript is served in a way that the browser understands that it is encoded with UTF-8.  You could use .htaccess to ensure that the content type header is set correctly, eg:
  Content-Type: text/javascript; charset=utf-8

Or you could try including the encoding on the script tag (although theoretically that isn't supposed to override the headers):
  <script type="text/javascript" src="script.js" charset="utf-8"></script>


However...  that is all a bit too fragile.  I think clojurescript needs fixing to be more robust:

When you use compile with optimizations, the closure compiler accepts utf-8 input, but outputs everything as us-ascii, using unicode backslash-u escapes to represent unicode characters.  This is much safer. But when you compile without optimizations, clojurescript just outputs unicode as utf-8 directly.

I think clojurescript should be modified to use us-ascii encoding, and backslash-u escaping like the closure compiler does.

-- 
Dave



Reply all
Reply to author
Forward
0 new messages