Parsing Unicode character with Clojure

良ϖ

unread,

Aug 9, 2015, 11:49:21 AM8/9/15

to clo...@googlegroups.com

I've come on some trouble when parsing an Unicode character with
Clojure. I know it's likely to be a problem related to Java and not
Clojure itself but I'm looking for a Clojurish solution so that's why
I'm posting it here. FYI, I have a GNU / Linux OS on the top on which
I use emacs 24 in cunjunction with CIDER 0.10.0snapshot (package:
20150710.1304), Java 1.8.0_51, Clojure 1.6.0 and nREPL 0.2.6.

The first character of the Unicode block "CJK Unified Ideographs
Extension B" is 𠀀 (hope you can properly read it, get a Chinese font
otherwise). Emacs perfectly deals with it but in gedit, it's like this
character would have the glyph you see (something like ㄛ but more
angular) plus a negative space. In emacs it's displayed properly but
when it comes to be evaluated, the behaviour is weird:

``` Clojure
華文.core> (clojure.string/split "a𠀀a" #"\𠀀")
; => ["a" "a"]
華文.core> (clojure.string/split "a𠀀a" #"\u20000")
["a𠀀a"]
華文.core> (clojure.string/split "a𠀀a" #"[\u20000-\u2a6df]") ; it spans
over Extension B
; => ["" "𠀀"]
```

Moreover:

``` Clojure
華文.core> \u20000
; => IllegalArgumentException Invalid unicode character: \u20000
clojure.lang.LispReader.readUnicodeChar
華文.core> (int \𠀀)
; => RuntimeException Unsupported character: \𠀀
clojure.lang.Util.runtimeException (Util.java:221)
華文.core> (format "%04x" (int \u3403))
; => "3403"
華文.core> (format "%04x" (int \u20000))
; => IllegalArgumentException Invalid unicode character: \u20000
clojure.lang.LispReader.readUnicodeChar
```

Finally here is a very annoying side-effect, just like an overflow:
from 20000 it overlaps values from 0, so the whole legacy ASCII would
be contained is this block.

``` Clojure
華文.core> (clojure.string/split "cabac" #"[\u20000-\u2a6df]")
; => []
華文.core> (clojure.string/split "cabac" #"[a-b]")
; => []
```

Then I don't really know how I could handle this character. I've
picked haphazardly some characters and it seems to be the same mess
above \u9999 :/

Andy Fingerhut

unread,

Aug 9, 2015, 12:22:41 PM8/9/15

to clo...@googlegroups.com

Java uses UTF-16 encoding in memory for String objects. Characters in the Basic Multilingual Plane are represented as a single 16-bit character in memory, but anything outside the BMP is represented as a sequence of 2 16-bit characters. Clojure's \u<hex number> syntax can only be used to directly represent a 16-bit character.

To represent characters outside the BMP, you can either use two \u<hex number> sequences, doing the UTF-16 encoding yourself by hand, or you can use a Java function like (Character/toChars 0x20000) to get a Java array of characters for Unicode code point 0x20000, or (String. (Character/toChars 0x20000)) to get a string.

Andy

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Andy Fingerhut

unread,

Aug 9, 2015, 12:24:56 PM8/9/15

to clo...@googlegroups.com

Oh, and whether or not Java regular expressions let you specify ranges of such characters outside the BMP, I have no idea. I would expect there to be odd behavior in that area of Java's regular expression implementation, but haven't done extensive testing myself to find out. I would recommend that you do not rely on any behavior you have not tested extensively yourself there.

Andy

Reply all

Reply to author

Forward