I am coding a web application in which the content is a Unicode string built up over multiple functions and maintained in a State structure.
I gather that the String module is inefficient and that Data.Text would be a better choice.
Is it more efficient to build up a list of Text objects over time and combine them together with a single Data.Text.concat for the final output or to run Data.Text.append for each new string so that I am maintaining a single Text object rather than a list?
As Data.Text.append requires copying both strings each time, my gut feeling is that concat would be much more efficient, but Haskell has surprised me before, so I wanted to check.
Kevin
_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
I'd say, use Data.Text.Lazy and its 'fromChunks' function if you produce
the string chunkwise. That avoids copying.
Perhaps Data.ByteString[.Lazy].UTF8 is an even better choice than Data.Text
(depends on what you do).
Well, not necessarily.
String can be quite memory efficient. As a stupid example,
length (replicate 10000000 'a')
will need less memory than the equivalents using ByteString or Text.
Less stupidly, if the String is lazily produced and consumed from head to
last, String is memory efficient. And it's not necessarily much slower than
ByteString or Text.
In fact, String is sometimes faster than Text (cf. e.g.
http://www.haskell.org/pipermail/haskell-cafe/2010-May/078220.html and
following).
When you have to deal with text that is ASCII or latin1 (or some other
encoding with a byte <-> char correspondence), plain ByteStrings are
usually by far the fastest method. But that's of course a severe
restriction.
>
> So why is there a UTF8 implementation for bytestrings? Does that not
> duplicate what Text is trying to do? If so, why the duplication?
I think Data.ByteString.UTF8 predates Data.Text.
> When is each library more appropriate?
Generally, ByteString for binary data or text, when you know it's safe and
you need the speed.
For text, either String or Data.Text may be the better choice.
IIRC, Data.Text uses utf-16 (or some other 16-bit encoding), so if you
receive utf-8 encoded text, Data.ByteString.UTF8 can be the better choice.
I haven't much experience with either Data.Text or Data.ByteString.UTF8, so
I can't say much about their relative merits.
>
> Kevin
http://hackage.haskell.org/packages/archive/rope/0.6.1/doc/html/Data-Rope.html
I'd also be happy to work with you if the current API falls short of your
needs.
-Edward Kmett
> String can be quite memory efficient. As a stupid example,
>
> length (replicate 10000000 'a')
>
> will need less memory than the equivalents using ByteString or Text.
>
Actually, this will be fused with Data.Text, and should execute more quickly
and in less space than String.
>> So why is there a UTF8 implementation for bytestrings? Does that not
>> duplicate what Text is trying to do? If so, why the duplication?
> I think Data.ByteString.UTF8 predates Data.Text.
One difference is that Data.Text uses UTF-16 internally, not UTF-8.
>> When is each library more appropriate?
Much data is overwhelmingly ASCII, but with an option for non-ASCII in
comments, labels, or similar. E.g., for biological sequence data, files
can be large (the human genome is about 3GB) and non-ascii characters
can only occur in sequence headers which constitute a miniscule fraction
of the total data. So I use ByteString for this.
-k
--
If I haven't seen further, it is by standing in the footprints of giants
Right, forgot about fusion. However, that requires the code to be compiled
with optimisations, I think (well, one should never compile ByteString or
Text code without).
In which case
{-# RULES
"length/replicate" forall n x. length (replicate n x) = max 0 n
#-}
would be at least as good as the Data.Text thing ;)
Or, to be more fair, if you use Data.List.Stream, it should be fused too
and be equally efficient as Data.Text.