[Haskell-cafe] [ANN] unicode-transforms-0.2.0 pure Haskell unicode normalization

Harendra Kumar

unread,

Oct 25, 2016, 12:59:33 PM10/25/16

to haskell-cafe, has...@haskell.org

Hi,

I released unicode-transforms sometime back as bindings to a C library (utf8proc). Since then I have rewritten it completely in Haskell. Haskell data structures are automatically generated from unicode database, so it can be kept up-to-date with the standard unlike the C implementation which was stuck at unicode 5. The implementation comes with a test suite providing 100% code coverage.

After a number of algorithmic and implementation efficiency optimizations, I was able to get several times better decompose performance compared to the C implementation. I have not yet got a chance to fully optimize the compose operations but they are still as fast as utf8proc.

I would like to thank Antonio Nikishaev for the unicode character database parsing code which I borrowed from the prose library.

https://github.com/harendra-kumar/unicode-transforms

https://hackage.haskell.org/package/unicode-transforms

-harendra

William Yager

unread,

Oct 25, 2016, 1:06:22 PM10/25/16

to Harendra Kumar, haskell-cafe

Interesting! What would you say allowed you to get better decompose performance than the C library?

Will

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

Harendra Kumar

unread,

Oct 25, 2016, 1:34:22 PM10/25/16

to William Yager, haskell-cafe

I did not fully compare the implementation, I just focussed on getting as much performance out of the Haskell implementation as was possible. I can say two things that might have allowed it to be better:

1) I extracted as much as was possible in terms of implementation efficiency of the Haskell code. So I did not lose there. The code could have been much simpler without all the optimizations.

2) My implementation may be better in terms of algorithms and data structures used. Unicode normalization is complicated, the implementation can differ in many ways making you lose or gain performance.

Beating the utf8proc implementation was easy. The best (highly optimized) normalization implementation is the ICU C++ implementation and my target was to get close to that. I got pretty close to it (using llvm backend) in most benchmarks and even beat it clearly in one benchmark. There are a couple of enhancements that I filed against GHC, hopefully they will allow it to be completely at par in all benchmarks. Though the difference may not matter other than proving that it can be as good.

-harendra

Reply all

Reply to author

Forward