32-bit Unicode character literals

1,384 views
Skip to first unread message

samppi

unread,
Apr 26, 2009, 7:47:06 PM4/26/09
to Clojure
In the REPL:

Clojure
user=> \u0032
\2
user=> \u10000
java.lang.IllegalArgumentException: Invalid unicode character: \u10000

How would I embed the character as a literal in my Clojure code?

Stephen C. Gilardi

unread,
Apr 26, 2009, 9:22:37 PM4/26/09
to clo...@googlegroups.com
On Apr 26, 2009, at 7:47 PM, samppi wrote:

> user=> \u10000
> java.lang.IllegalArgumentException: Invalid unicode character: \u10000
>
> How would I embed the character as a literal in my Clojure code?

Java characters are (still) 16 bits wide. A single Java character
cannot represent the Unicode character you're looking to represent.
Since Clojure characters are Java characters, you'll need to do this
the way the Java folks do.

I found a blog post about it here:

http://weblogs.java.net/blog/joconner/archive/2004/04/unicode_40_supp.html

This is also a good reference:

http://www.fileformat.info/info/unicode/char/10000/index.htm

This representation as a string from that page does seem to work in
Clojure:

"\ud800\udc00"

--Steve

samppi

unread,
Apr 27, 2009, 10:07:59 AM4/27/09
to Clojure
I see. Does this mean that, if I expect to handle 32-bit characters,
then I need to consider changing my character-handling functions to
accept sequences of vectors instead?

Also, how does (seq "\ud800\udc00") work? Does it split the character
into two 16-bit characters? In the REPL, it seems to return (\? \?).

On Apr 26, 6:22 pm, "Stephen C. Gilardi" <squee...@mac.com> wrote:
> On Apr 26, 2009, at 7:47 PM, samppi wrote:
>
> > user=> \u10000
> > java.lang.IllegalArgumentException: Invalid unicode character: \u10000
>
> > How would I embed the character as a literal in my Clojure code?
>
> Java characters are (still) 16 bits wide. A single Java character  
> cannot represent the Unicode character you're looking to represent.  
> Since Clojure characters are Java characters, you'll need to do this  
> the way the Java folks do.
>
> I found a blog post about it here:
>
>        http://weblogs.java.net/blog/joconner/archive/2004/04/unicode_40_supp...
>
> This is also a good reference:
>
>        http://www.fileformat.info/info/unicode/char/10000/index.htm
>
> This representation as a string from that page does seem to work in  
> Clojure:
>
>         "\ud800\udc00"
>
> --Steve
>
>  smime.p7s
> 3KViewDownload

Christophe Grand

unread,
Apr 27, 2009, 10:36:26 AM4/27/09
to clo...@googlegroups.com
samppi a écrit :

> I see. Does this mean that, if I expect to handle 32-bit characters,
> then I need to consider changing my character-handling functions to
> accept sequences of vectors instead?
>
> Also, how does (seq "\ud800\udc00") work? Does it split the character
> into two 16-bit characters? In the REPL, it seems to return (\? \?).
>

seq on a String returns a sequence of Java characters (16 bits values).

(defn codepoints-seq [s] ; returns a seq of ints
(let [s (str s)
n (count s)
f (fn this [i]
(lazy-seq
(when (< i n)
(cons (.codePointAt s i)
(this (.offsetByCodePoints s i 1))))))]
(f 0)))

;; => (codepoint-seq "\ud800\udc00a\ud800\udd00")
;; (65536 97 65792)

--
Professional: http://cgrand.net/ (fr)
On Clojure: http://clj-me.blogspot.com/ (en)


Stephen C. Gilardi

unread,
Apr 27, 2009, 10:50:19 AM4/27/09
to clo...@googlegroups.com

On Apr 27, 2009, at 10:07 AM, samppi wrote:

> I see. Does this mean that, if I expect to handle 32-bit characters,
> then I need to consider changing my character-handling functions to
> accept sequences of vectors instead?

The blog post touches on this and searching around on Google and
Wikipedia should turn up more info. Many APIs, especially in the
Character class now have versions that accept and return "int"
arguments (in addition to versions that accept and return "char"
arguments) to support code points beyond 0xFFFF. int is wide enough to
hold any individual Unicode character.

It may be convenient for you to work Strings rather than individual
characters when possible.

Java strings (as of 1.5) are now UTF-16 encoded. This encoding allows
(legal) Unicode code points in the range 0 to 0xFFFF to be encoded as
a single Java character. To represent code points outside that range,
UTF-16 uses a range of code points that are illegal for a single
unicode character (0xD800–0xDFFF) to encode the code point a pair of
16-bit values (Chars) called a surrogate pair. Using this encoding you
can represent all strings made up of legal Unicode code points as Java
strings.

> Also, how does (seq "\ud800\udc00") work? Does it split the character
> into two 16-bit characters? In the REPL, it seems to return (\? \?).

seq doesn't know about the UTF16 encoding. It returns a sequence of
every Char even if it is part of a surrogate pair. It would be
possible to write a seq implementation for strings that knows about
UTF-16 and returns a sequence of Unicode code points represented as
ints instead. Whether doing that is useful or not depends on what
you're hoping to do with the characters.

--Steve

Reply all
Reply to author
Forward
0 new messages