Proposal: a library of character/Unicode utility functions

492 views
Skip to first unread message

Dave Yarwood

unread,
Oct 12, 2014, 7:19:26 PM10/12/14
to cloju...@googlegroups.com
Hi everyone,

I came to Clojure already being familiar with Haskell, so it was natural to compare the two. One thing that I noticed is that Clojure lacks a library of character utility functions. Haskell's Data.Char (http://hackage.haskell.org/package/base-4.7.0.1/docs/Data-Char.html) is a good example of the kind of thing I'm talking about. I also think it would be nice for Clojure to have a more intuitive way to represent ranges of characters, such as all lowercase letters, all (Latin-1) digits, etc.

As it is, it's easy enough to use Java inter-op to utilize the methods in Java's Character class, and using (map char (range (int \a) (inc (int \z)))) to represent all lowercase letters isn't TOO terribly obnoxious, but I think that having a more intuitive way of working with characters in Clojure would be an asset to the language's expressivity and friendliness to newcomers.

With that in mind, I've taken the liberty of putting together this rough draft of a possible character utility library for Clojure: https://github.com/daveyarwood/djy

I could see this being namespaced as clojure.char (on analogy with clojure.string, clojure.set, etc.), or perhaps as a contrib library if that is more appropriate. Ultimately, if Clojure/core is not interested, I could just make it available as a Clojar for whomever may find it useful.

Some key things that I've put into this library:
- API functions are polymorphic -- most of them can accept characters, integers representing Unicode code points, or strings (allowing you to work with supplementary characters that cannot be represented as Java/Clojure character literals).
- There are wrappers for Java methods (Character/isISOControl, Character/toUpperCase, etc.), as well as other useful utility functions.
- char' (on analogy with +', *', etc.) -- an extended version of the char function that will return a string containing a supplementary character if you provide the code point, e.g. (char' 135641) => 𡇙
- next and prev functions, which will return the character one code point before or after a given character/code point
- char-seq -- like seq (when used on a string), but each of the resulting seq's items can be either a character or a string containing a supplemental character. You could use this function to get an accurate character count in a body of text that contains supplementary characters.
- char-range -- returns the range of chars from %1 to %2.. This would make it significantly easier to represent things like all lowercase letters -- (char-range \a \z)

This is still a work in progress -- I'd be happy to hear any constructive feedback / criticism you guys might have, as well as any suggestions for character-related functions that could be useful for a library like this. Let me know what you think.

Thanks!
Dave

Ambrose Bonnaire-Sergeant

unread,
Oct 12, 2014, 8:42:50 PM10/12/14
to cloju...@googlegroups.com
A superficial suggestion: djy.char should probably have (:refer-clojure :exclude [symbol?]) in the ns form.

Thanks,
Ambrose

Dave

--
You received this message because you are subscribed to the Google Groups "Clojure Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure-dev...@googlegroups.com.
To post to this group, send email to cloju...@googlegroups.com.
Visit this group at http://groups.google.com/group/clojure-dev.
For more options, visit https://groups.google.com/d/optout.

Dave Yarwood

unread,
Oct 13, 2014, 11:38:36 AM10/13/14
to cloju...@googlegroups.com, abonnair...@gmail.com
That's a good idea, thanks -- I will do that for both `symbol?` and `next`.

John D. Hume

unread,
Oct 13, 2014, 12:22:37 PM10/13/14
to cloju...@googlegroups.com, abonnair...@gmail.com
I'm not sure if this is a controversial suggestion, but I'd avoid using 'next' for a fn that's not at all like clojure.core/next.

Dave Yarwood

unread,
Oct 13, 2014, 12:46:29 PM10/13/14
to cloju...@googlegroups.com, abonnair...@gmail.com

I think that's a good point. I did notice in the style guidelines that we should avoid doing that specifically.

On further thought, I think inc and dec would be better names for the functions currently called prev and next. The only thing is, I'm using clojure.core/inc and dec all over the place, so I guess I would have to fully qualify those functions wherever I'm using them, which could get a little messy... Thoughts on that?

--
You received this message because you are subscribed to a topic in the Google Groups "Clojure Dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/clojure-dev/CVT5nqCz9XI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to clojure-dev...@googlegroups.com.

Colin Jones

unread,
Oct 13, 2014, 12:54:24 PM10/13/14
to cloju...@googlegroups.com
I've wanted something like this in the past - it's been pretty rough
whenever I've had to work with non-BMP characters (in Java). I think a
library like this (whether in contrib or elsewhere) would be a
terrific addition to the ecosystem.

Cool stuff.

On Sun, Oct 12, 2014 at 7:19 PM, Dave Yarwood <dave.y...@gmail.com> wrote:
> --
> You received this message because you are subscribed to the Google Groups "Clojure Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to clojure-dev...@googlegroups.com.
> To post to this group, send email to cloju...@googlegroups.com.
> Visit this group at http://groups.google.com/group/clojure-dev.
> For more options, visit https://groups.google.com/d/optout.



--
Colin Jones
@trptcolin

Alex Miller

unread,
Oct 13, 2014, 1:13:44 PM10/13/14
to cloju...@googlegroups.com
In general, I think this is a good idea for a library (contrib or no).  Few specific comments below.


On Sun, Oct 12, 2014 at 6:19 PM, Dave Yarwood <dave.y...@gmail.com> wrote:
Hi everyone,

I came to Clojure already being familiar with Haskell, so it was natural to compare the two. One thing that I noticed is that Clojure lacks a library of character utility functions. Haskell's Data.Char  (http://hackage.haskell.org/package/base-4.7.0.1/docs/Data-Char.html) is a good example of the kind of thing I'm talking about. I also think it would be nice for Clojure to have a more intuitive way to represent ranges of characters, such as all lowercase letters, all (Latin-1) digits, etc.

As it is, it's easy enough to use Java inter-op to utilize the methods in Java's Character class, and using (map char (range (int \a) (inc (int \z)))) to represent all lowercase letters isn't TOO terribly obnoxious, but I think that having a more intuitive way of working with characters in Clojure would be an asset to the language's expressivity and friendliness to newcomers.

With that in mind, I've taken the liberty of putting together this rough draft of a possible character utility library for Clojure: https://github.com/daveyarwood/djy

I could see this being namespaced as clojure.char (on analogy with clojure.string, clojure.set, etc.), or perhaps as a contrib library if that is more appropriate. Ultimately, if Clojure/core is not interested, I could just make it available as a Clojar for whomever may find it useful.

Probably best as a contrib but I haven't checked with Rich yet.
 
Some key things that I've put into this library:
- API functions are polymorphic -- most of them can accept characters, integers representing Unicode code points, or strings (allowing you to work with supplementary characters that cannot be represented as Java/Clojure character literals).

+1 
 
- There are wrappers for Java methods (Character/isISOControl, Character/toUpperCase, etc.), as well as other useful utility functions.

Mere wrappers are meh. However, one reason such a thing might be useful is if they define a portable API implementable across hosts (JVM/JS/CLR).
 
 - char' (on analogy with +', *', etc.) -- an extended version of the char function that will return a string containing a supplementary character if you provide the code point, e.g. (char' 135641) => 𡇙

+1 (not sure of char' name but idea seems good)
 
- next and prev functions, which will return the character one code point before or after a given character/code point

+1 but wouldn't name them next/prev. succ/pred would be better (or possibly some play on inc/dec).
 
- char-seq -- like seq (when used on a string), but each of the resulting seq's items can be either a character or a string containing a supplemental character. You could use this function to get an accurate character count in a body of text that contains supplementary characters.

It seems like some (all?) of this already exists via String/Char but I haven't used supplemental characters enough to know where the gaps are. 
 
- char-range -- returns the range of chars from %1 to %2.. This would make it significantly easier to represent things like all lowercase letters -- (char-range \a \z)

+1 - I've built variants of this in aid of test generators several times.
 
This is still a work in progress -- I'd be happy to hear any constructive feedback / criticism you guys might have, as well as any suggestions for character-related functions that could be useful for a library like this. Let me know what you think.

Thanks!
Dave

Dave Yarwood

unread,
Oct 13, 2014, 2:35:14 PM10/13/14
to cloju...@googlegroups.com
Thanks for all the great feedback! I've written some comments below:

On Monday, October 13, 2014 1:13:44 PM UTC-4, Alex Miller wrote:
In general, I think this is a good idea for a library (contrib or no).  Few specific comments below.

On Sun, Oct 12, 2014 at 6:19 PM, Dave Yarwood <dave.y...@gmail.com> wrote:
- There are wrappers for Java methods (Character/isISOControl, Character/toUpperCase, etc.), as well as other useful utility functions.

Mere wrappers are meh. However, one reason such a thing might be useful is if they define a portable API implementable across hosts (JVM/JS/CLR).
 
I was kind of wondering about that, and considering taking out the ones that are just wrappers. There are some that I think might still be more useful than the Java methods they wrap, by virtue of making them into polymorphic functions. Most of the Java Character class methods take either a character or a code point (integer) as an argument, so if you want to use them on a supplementary character, you have to supply the code point. For example, (Character/isLetter 120121) => true, vs. being able to do (letter? "𝔹") => true. The letter? function is just #(Character/isLetter (code-point-of %)), where code-point-of converts the supplementary character at index 0 of the string to its Unicode code point.
 
On the other hand, we could probably lose the 3 case conversion functions at the bottom of the file, which are literally just wrappers for Character/toLowerCase, toUpperCase and toTitleCase. It's easy enough to type (Character/toUpperCase \ü), for example.
 
 
 - char' (on analogy with +', *', etc.) -- an extended version of the char function that will return a string containing a supplementary character if you provide the code point, e.g. (char' 135641) => 𡇙

+1 (not sure of char' name but idea seems good)
 
I went back and forth between char* and char' as possible names for this function... googling Clojure naming conventions, it seems like the foo* syntax is mostly used in cases where foo is a macro and foo* is the function that it uses internally, so it doesn't seem like char* should be the name of a public API function. char' seemed like a good fit because you can think of it as doing a similar thing as the arbitrary precision math operator functions like +', *', etc. Those functions "promote" their arguments to arbitrary precision arithmetic types as needed, so I'm thinking of char' as a function that returns a character and "promotes" it to a string if it's above the BMP range supported by Java character literals.
 
I'm not 100% tied to that name, though, and would be interested in any alternative naming ideas.
 
 
- next and prev functions, which will return the character one code point before or after a given character/code point

+1 but wouldn't name them next/prev. succ/pred would be better (or possibly some play on inc/dec).
 
I'm starting to lean towards char/inc and char/dec as the names for these functions, on analogy with clojure.core's inc and dec for integers.
I guess the big question here is whether this library will be intended to be "used" or "required." Right now I'm taking inspiration from clojure.string, which has some functions that would conflict with clojure.core functions, and so the intended usage is to require it ":as str" or s or whatever so that the conflicting functions are all namespaced. If the intended usage for this library would be something like (require '[djy.char :as char]), then the functions could be called like (char/inc \a).
 
I do like succ/pred, as an alternative. Or maybe next-char/prev-char.
 
 
 
- char-seq -- like seq (when used on a string), but each of the resulting seq's items can be either a character or a string containing a supplemental character. You could use this function to get an accurate character count in a body of text that contains supplementary characters.

It seems like some (all?) of this already exists via String/Char but I haven't used supplemental characters enough to know where the gaps are. 
 
It's only an issue if you're dealing with a string that contains one or more supplementary characters, i.e. the code point of the character is greater than 65535. Most of the text that we see (in the Western world) rarely ever contains supplementary characters, but you start to see a lot more of these characters in Asian languages. Supplementary characters technically are not characters -- at least in Java, characters are 32-bit and thus can only contain code point values up to 65535, i.e. BMP characters. Supplementary characters are represented as a combination of 2 BMP characters (called surrogates) within specific ranges. In Java, you can work with supplementary characters either by using the code point (which is just an integer), or via a string containing the two surrogate characters, which should be displayed properly as the supplementary character if your OS is set up to display Unicode characters and you have the right fonts installed.
 
So, a Java string containing a single supplementary character, is actually two characters long. This can result in some inaccuracy if you're working with supplementary characters, e.g.
(count "𡇙") => 2
(map int "this should be one code point: 𡇙") => (116 104 105 115 32 115 104 111 117 108 100 32 98 101 32 111 110 101 32 99 111 100 101 32 112 111 105 110 116 58 32 55364 56793)
 
Instead of the code point of that supplementary character, 135641, we get the code points of its surrogate pair, 55364 and 56793. If, in the process of slicing that string, those two characters were to be separated, the supplementary character would be lost.

Mikera

unread,
Oct 13, 2014, 11:43:10 PM10/13/14
to cloju...@googlegroups.com
Hi Dave,

The overall idea sounds great, however having dynamic type inspection will affect performance quite significantly in many cases - probably by an order of magnitude or two based on my similar experiences with numeric code. This could rule out the use of this library for many use cases - given that low level character handling functions are likely to be in performance-sensitive areas (parsing? text analysis? data conversion?) . If we are not careful, people will just end up re-writing more efficient versions of the function, which will defeat the purpose of having a standard library.

Therefore I think this requires some careful thought about how to make it both easy and fast. Of course, there is nothing wrong with making it work, then making it fast later. But if we are going to do that, and if this is going to be a widely used / official contrib library, we would want to plan Clojure language changes to be aligned with this goal.

Examples of language changes that might help:
- More analysis of argument types at call sites to use primitive functions / short-circuit multi-methods / eliminate type checks where possible 
- Suppression of the warnings that we currently get when replacing a core function (I'm thinking clojure.core/char here)
- Compiler macros?

Dave Yarwood

unread,
Oct 14, 2014, 12:39:04 PM10/14/14
to cloju...@googlegroups.com
That is some great insight -- I was wondering if performance might be a potential issue, relying on a multimethod like that. I figured that performance would not be a major issue because in my own (admittedly simple, and small-scale) benchmarks, I was seeing differences of only a couple hundredths of a millisecond -- this was for mapping my code-point-of multimethod vs. clojure.core/int over a 1,000,000-character long string of randomly generated BMP characters. I haven't given it a better test, though, and I can see how my current implementation might not scale very well for working with huge bodies of Unicode text, which is a likely scenario for this kind of a library.
 
I'm a little new to this -- do you have any ideas for a better benchmark? It might help for us to have an idea of how big of a performance drain the dynamic type inspection would cause. Of course, ideally we would have a better/faster solution that doesn't rely on type inspection.
 
Here's an idea: what if I split the library into 2 separate namespaces -- one for working with BMP characters and one for working with supplementary characters? This would eliminate the need for dynamic type inspection, as the BMP characters would have to be character literals, and the supplementary characters would have to be represented as strings. I could even include optional, dynamic functions for scenarios where performance isn't as much of a concern -- functions that do the type checking and then delegate to the appropriate library's function. Do you think that would help?
 
Thanks,
Dave

Dave Yarwood

unread,
Oct 14, 2014, 12:45:35 PM10/14/14
to cloju...@googlegroups.com
P.S. for this hypothetical BMP character library, I could rely on Java inter-op, as many of the Character class methods can accept either a character or an int. Doing away with the code-point-of multimethod would actually eliminate the need for many of these functions, since they would just become simple wrappers for the underlying Java methods. So, I think both of these libraries would be pretty small and concise, which would be a major plus.
 
I'm starting to like this idea... maybe I'll play around with this a little tonight.

Dave Yarwood

unread,
Oct 14, 2014, 1:02:47 PM10/14/14
to cloju...@googlegroups.com
... of course, this approach wouldn't really help for those use cases where you're working with a large volume of text, some of which you know is / could be supplementary characters, and you want to be able to treat them as supplementary characters instead of surrogates. So the performance thing would still be an unsolved problem for those cases. However, realistically we don't see supplementary characters very often, so I think having a BMP-specific library that doesn't rely on dynamic type inspection could help with performance quite a bit.

Andy Fingerhut

unread,
Oct 14, 2014, 2:21:41 PM10/14/14
to cloju...@googlegroups.com
Hugo Duncan's criterium library may be useful to you in benchmarking: https://github.com/hugoduncan/criterium

Andy

--

Mikera

unread,
Oct 14, 2014, 9:18:25 PM10/14/14
to cloju...@googlegroups.com
Protocols are BTW much better than multimethods. Though still far from optimal. My rule of thumb estimates are:
- Statically types, primitive Java call: 1-2ns
- Protocol call (dispatch on first argument): 10-20s
- Multimethod call (dispatch on function of all arguments): 100-200ns

For testing this sort of stuff, I would suggestion building a range of small synthetic benchmarks that test a reasonable range of representative operations, and compare these to the pure Java equivalents.

Mapping possibly isn't the best way to test the character operations themselves, because mapping over a lazy sequence itself probably has more overhead than the basic character operations. I'd suggest doing just the pure operation in a simple "dotimes" loop or something similar. The criterium library is great for benchmarking this kind of stuff.

Dave Yarwood

unread,
Oct 15, 2014, 5:49:06 PM10/15/14
to cloju...@googlegroups.com
Well, that's definitely a start. I could at least rewrite my multimethod as a protocol to get a ~10x performance boost.
 
I had another thought -- What if I were to write that piece of the library in Java? It would just be a simple, polymorphic method that takes either a number, a character, or a string (presumably starting with a supplementary character) and returns the code-point, and then use that method from the Clojure library? Would that solve the performance problem? Would writing Java code for a Clojure library be frowned upon?

--
You received this message because you are subscribed to a topic in the Google Groups "Clojure Dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/clojure-dev/CVT5nqCz9XI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to clojure-dev...@googlegroups.com.

John Gabriele

unread,
Oct 17, 2014, 12:05:16 PM10/17/14
to cloju...@googlegroups.com
On Monday, October 13, 2014 2:35:14 PM UTC-4, Dave Yarwood wrote:
Thanks for all the great feedback! I've written some comments below:

On Monday, October 13, 2014 1:13:44 PM UTC-4, Alex Miller wrote:
In general, I think this is a good idea for a library (contrib or no).  Few specific comments below.

On Sun, Oct 12, 2014 at 6:19 PM, Dave Yarwood <dave.y...@gmail.com> wrote:
 
- next and prev functions, which will return the character one code point before or after a given character/code point

+1 but wouldn't name them next/prev. succ/pred would be better (or possibly some play on inc/dec).
 
I'm starting to lean towards char/inc and char/dec as the names for these functions, on analogy with clojure.core's inc and dec for integers.
I guess the big question here is whether this library will be intended to be "used" or "required." Right now I'm taking inspiration from clojure.string, which has some functions that would conflict with clojure.core functions, and so the intended usage is to require it ":as str" or s or whatever so that the conflicting functions are all namespaced. If the intended usage for this library would be something like (require '[djy.char :as char]), then the functions could be called like (char/inc \a).
 
I do like succ/pred, as an alternative. Or maybe next-char/prev-char.
 

next-char and prev-char are very good names for this, IMO.

(Incidentally, I always wondered why Clojure doesn't use incr (for "increment") and decr (for "decrement") instead of inc (which makes me think "include") and dec (which makes me think "decimal").)

-- John

John Gabriele

unread,
Oct 17, 2014, 12:22:22 PM10/17/14
to cloju...@googlegroups.com
On Friday, October 17, 2014 12:05:16 PM UTC-4, John Gabriele wrote:
 
I do like succ/pred, as an alternative. Or maybe next-char/prev-char.
 

next-char and prev-char are very good names for this, IMO.


Oh. Or char-before and char-after.

Dave Yarwood

unread,
Feb 8, 2015, 6:45:53 PM2/8/15
to cloju...@googlegroups.com
Bumping to see if there is any interest in collaborating to make this library more performant? I've done a little work on it since I last posted here (3 months ago). Most of the functions still depend on "code-point-of," which is doing dynamic type introspection so that the functions can work on character literals, ints or strings, but now code-point-of is implemented as a protocol, which is significantly faster than the previous multimethod approach. Performance is still lacking, though -- I've added some benchmarks to the repo. 

https://github.com/daveyarwood/djy/wiki/Benchmarks -- notice that djy.char functions tend to execute in ms, where the equivalent Clojure/Java core functions execute in µs. The advantage of my library's functions is being able to work with input that may contain supplementary characters without necessarily knowing ahead of time whether each character is BMP or supplementary -- it's a usability thing. Performance is clearly a concern, though. 

I'd be very interested in any suggestions for making djy.char more performant -- I could see this potentially being a very useful contrib library, if the performance issues can be overcome. I'm also very open to collaboration/pull requests, if anyone feels like hacking on this with me!

Cheers,
Dave
Reply all
Reply to author
Forward
0 new messages