I did not fully compare the implementation, I just focussed on getting as much performance out of the Haskell implementation as was possible. I can say two things that might have allowed it to be better:
1) I extracted as much as was possible in terms of implementation efficiency of the Haskell code. So I did not lose there. The code could have been much simpler without all the optimizations.
2) My implementation may be better in terms of algorithms and data structures used. Unicode normalization is complicated, the implementation can differ in many ways making you lose or gain performance.
Beating the utf8proc implementation was easy. The best (highly optimized) normalization implementation is the ICU C++ implementation and my target was to get close to that. I got pretty close to it (using llvm backend) in most benchmarks and even beat it clearly in one benchmark. There are a couple of enhancements that I filed against GHC, hopefully they will allow it to be completely at par in all benchmarks. Though the difference may not matter other than proving that it can be as good.
-harendra