> user=> \u10000
> java.lang.IllegalArgumentException: Invalid unicode character: \u10000
>
> How would I embed the character as a literal in my Clojure code?
Java characters are (still) 16 bits wide. A single Java character
cannot represent the Unicode character you're looking to represent.
Since Clojure characters are Java characters, you'll need to do this
the way the Java folks do.
I found a blog post about it here:
http://weblogs.java.net/blog/joconner/archive/2004/04/unicode_40_supp.html
This is also a good reference:
http://www.fileformat.info/info/unicode/char/10000/index.htm
This representation as a string from that page does seem to work in
Clojure:
"\ud800\udc00"
--Steve
seq on a String returns a sequence of Java characters (16 bits values).
(defn codepoints-seq [s] ; returns a seq of ints
(let [s (str s)
n (count s)
f (fn this [i]
(lazy-seq
(when (< i n)
(cons (.codePointAt s i)
(this (.offsetByCodePoints s i 1))))))]
(f 0)))
;; => (codepoint-seq "\ud800\udc00a\ud800\udd00")
;; (65536 97 65792)
--
Professional: http://cgrand.net/ (fr)
On Clojure: http://clj-me.blogspot.com/ (en)
> I see. Does this mean that, if I expect to handle 32-bit characters,
> then I need to consider changing my character-handling functions to
> accept sequences of vectors instead?
The blog post touches on this and searching around on Google and
Wikipedia should turn up more info. Many APIs, especially in the
Character class now have versions that accept and return "int"
arguments (in addition to versions that accept and return "char"
arguments) to support code points beyond 0xFFFF. int is wide enough to
hold any individual Unicode character.
It may be convenient for you to work Strings rather than individual
characters when possible.
Java strings (as of 1.5) are now UTF-16 encoded. This encoding allows
(legal) Unicode code points in the range 0 to 0xFFFF to be encoded as
a single Java character. To represent code points outside that range,
UTF-16 uses a range of code points that are illegal for a single
unicode character (0xD800–0xDFFF) to encode the code point a pair of
16-bit values (Chars) called a surrogate pair. Using this encoding you
can represent all strings made up of legal Unicode code points as Java
strings.
> Also, how does (seq "\ud800\udc00") work? Does it split the character
> into two 16-bit characters? In the REPL, it seems to return (\? \?).
seq doesn't know about the UTF16 encoding. It returns a sequence of
every Char even if it is part of a surrogate pair. It would be
possible to write a seq implementation for strings that knows about
UTF-16 and returns a sequence of Unicode code points represented as
ints instead. Whether doing that is useful or not depends on what
you're hoping to do with the characters.
--Steve