[Haskell-cafe] Efficient string construction

Kevin Jardine

unread,

Jun 3, 2010, 10:30:08 AM6/3/10

to haskel...@haskell.org

(I've done a basic Google search on this with no results. Apologies if this has been asked before.)

I am coding a web application in which the content is a Unicode string built up over multiple functions and maintained in a State structure.

I gather that the String module is inefficient and that Data.Text would be a better choice.

Is it more efficient to build up a list of Text objects over time and combine them together with a single Data.Text.concat for the final output or to run Data.Text.append for each new string so that I am maintaining a single Text object rather than a list?

As Data.Text.append requires copying both strings each time, my gut feeling is that concat would be much more efficient, but Haskell has surprised me before, so I wanted to check.

Kevin

_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Daniel Fischer

unread,

Jun 3, 2010, 11:08:49 AM6/3/10

to haskel...@haskell.org, Kevin Jardine

On Thursday 03 June 2010 16:03:11, Kevin Jardine wrote:
> (I've done a basic Google search on this with no results. Apologies if
> this has been asked before.)
>
> I am coding a web application in which the content is a Unicode string
> built up over multiple functions and maintained in a State structure.
>
> I gather that the String module is inefficient and that Data.Text would
> be a better choice.
>
> Is it more efficient to build up a list of Text objects over time and
> combine them together with a single Data.Text.concat for the final
> output or to run Data.Text.append for each new string so that I am
> maintaining a single Text object rather than a list?
>
> As Data.Text.append requires copying both strings each time, my gut
> feeling is that concat would be much more efficient, but Haskell has
> surprised me before, so I wanted to check.
>
> Kevin
>

I'd say, use Data.Text.Lazy and its 'fromChunks' function if you produce
the string chunkwise. That avoids copying.
Perhaps Data.ByteString[.Lazy].UTF8 is an even better choice than Data.Text
(depends on what you do).

Daniel Fischer

unread,

Jun 3, 2010, 12:47:55 PM6/3/10

to Kevin Jardine, haskel...@haskell.org

On Thursday 03 June 2010 17:26:36, Kevin Jardine wrote:

> --- On Thu, 6/3/10, Daniel Fischer <daniel.i...@web.de> wrote:
> > Perhaps Data.ByteString[.Lazy].UTF8 is an even better
> > choice than Data.Text (depends on what you do).
>

> I thought that I had the differences between the three libraries figured
> out but I guess not now from what you say.
>
> I had thought that String was a simple but memory inefficient model,
> that Text was for, well text, and that bytestrings were for binary data
> (eg. images, audio files and applications that required a true view on
> each text byte).

Well, not necessarily.
String can be quite memory efficient. As a stupid example,

length (replicate 10000000 'a')

will need less memory than the equivalents using ByteString or Text.
Less stupidly, if the String is lazily produced and consumed from head to
last, String is memory efficient. And it's not necessarily much slower than
ByteString or Text.

In fact, String is sometimes faster than Text (cf. e.g.
http://www.haskell.org/pipermail/haskell-cafe/2010-May/078220.html and
following).

When you have to deal with text that is ASCII or latin1 (or some other
encoding with a byte <-> char correspondence), plain ByteStrings are
usually by far the fastest method. But that's of course a severe
restriction.

>
> So why is there a UTF8 implementation for bytestrings? Does that not
> duplicate what Text is trying to do? If so, why the duplication?

I think Data.ByteString.UTF8 predates Data.Text.

> When is each library more appropriate?

Generally, ByteString for binary data or text, when you know it's safe and
you need the speed.
For text, either String or Data.Text may be the better choice.
IIRC, Data.Text uses utf-16 (or some other 16-bit encoding), so if you
receive utf-8 encoded text, Data.ByteString.UTF8 can be the better choice.
I haven't much experience with either Data.Text or Data.ByteString.UTF8, so
I can't say much about their relative merits.

>
> Kevin

Edward Kmett

unread,

Jun 3, 2010, 1:34:26 PM6/3/10

to Kevin Jardine, haskel...@haskell.org

You might also look at Data.Rope from the rope library, which provides an
O(1) append for strict bytestring chunks, and the ability to decode UTF-8
chars from the result.

http://hackage.haskell.org/packages/archive/rope/0.6.1/doc/html/Data-Rope.html

I'd also be happy to work with you if the current API falls short of your
needs.

-Edward Kmett

Bryan O'Sullivan

unread,

Jun 3, 2010, 6:45:00 PM6/3/10

to Daniel Fischer, Kevin Jardine, haskel...@haskell.org

On Thu, Jun 3, 2010 at 9:16 AM, Daniel Fischer <daniel.i...@web.de>wrote:

> String can be quite memory efficient. As a stupid example,
>
> length (replicate 10000000 'a')
>
> will need less memory than the equivalents using ByteString or Text.
>

Actually, this will be fused with Data.Text, and should execute more quickly
and in less space than String.

Ketil Malde

unread,

Jun 4, 2010, 3:08:12 AM6/4/10

to haskel...@haskell.org

Daniel Fischer <daniel.i...@web.de> writes:

>> So why is there a UTF8 implementation for bytestrings? Does that not
>> duplicate what Text is trying to do? If so, why the duplication?

> I think Data.ByteString.UTF8 predates Data.Text.

One difference is that Data.Text uses UTF-16 internally, not UTF-8.

>> When is each library more appropriate?

Much data is overwhelmingly ASCII, but with an option for non-ASCII in
comments, labels, or similar. E.g., for biological sequence data, files
can be large (the human genome is about 3GB) and non-ascii characters
can only occur in sequence headers which constitute a miniscule fraction
of the total data. So I use ByteString for this.

-k
--
If I haven't seen further, it is by standing in the footprints of giants

Daniel Fischer

unread,

Jun 4, 2010, 4:30:53 AM6/4/10

to Bryan O'Sullivan, Kevin Jardine, haskel...@haskell.org

Right, forgot about fusion. However, that requires the code to be compiled
with optimisations, I think (well, one should never compile ByteString or
Text code without).

In which case

{-# RULES
"length/replicate" forall n x. length (replicate n x) = max 0 n
#-}

would be at least as good as the Data.Text thing ;)

Or, to be more fair, if you use Data.List.Stream, it should be fused too
and be equally efficient as Data.Text.