[Haskell-cafe] String vs ByteString

130 views
Skip to first unread message

Erik de Castro Lopo

unread,
Aug 13, 2010, 7:32:39 AM8/13/10
to haskel...@haskell.org
Hi all,

I'm using Tagsoup to strip data out of some rather large XML files.

Since the files are large I'm using ByteString, but that leads me
to wonder what is the best way to handle clashes between Prelude
functions like putStrLn and the ByteString versions?

Anyone have any suggestions for doing this as neatly as possible?

Erik
--
----------------------------------------------------------------------
Erik de Castro Lopo
http://www.mega-nerd.com/
_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Michael Snoyman

unread,
Aug 13, 2010, 7:42:41 AM8/13/10
to haskel...@haskell.org
Just import the ByteString module qualified. In other words:

import qualified Data.ByteString as S

or for lazy bytestrings:

import qualified Data.ByteString.Lazy as L

Cheers,
Michael

Johan Tibell

unread,
Aug 13, 2010, 7:42:33 AM8/13/10
to haskel...@haskell.org
Hi Erik,


On Fri, Aug 13, 2010 at 1:32 PM, Erik de Castro Lopo <mle...@mega-nerd.com> wrote:
Since the files are large I'm using ByteString, but that leads me
to wonder what is the best way to handle clashes between Prelude
functions like putStrLn and the ByteString versions?

Anyone have any suggestions for doing this as neatly as possible?

Use qualified imports, like so:

import qualified Data.ByteString as B

main = B.putStrLn $ B.pack "test"

Cheers,
Johan

Michael Snoyman

unread,
Aug 13, 2010, 7:47:26 AM8/13/10
to Johan Tibell, haskel...@haskell.org
If you want to pack a String into a ByteString, you'll need to import Data.ByteString.Char8 instead.

Michael 

Johan Tibell

unread,
Aug 13, 2010, 7:53:20 AM8/13/10
to Michael Snoyman, haskel...@haskell.org
On Fri, Aug 13, 2010 at 1:47 PM, Michael Snoyman <mic...@snoyman.com> wrote:
Use qualified imports, like so:

import qualified Data.ByteString as B
 
main = B.putStrLn $ B.pack "test"

If you want to pack a String into a ByteString, you'll need to import Data.ByteString.Char8 instead.


Very true. That's what I get for using a random example without testing it first.

-- Johan

Pierre-Etienne Meunier

unread,
Aug 13, 2010, 10:03:17 AM8/13/10
to haskel...@haskell.org
Hi,

Why don't you use the Data.Rope library ?
The asymptotic complexities are way better than those of the ByteString functions.

PE

Johan Tibell

unread,
Aug 13, 2010, 10:08:10 AM8/13/10
to Pierre-Etienne Meunier, haskel...@haskell.org
On Fri, Aug 13, 2010 at 4:03 PM, Pierre-Etienne Meunier <pierreetie...@gmail.com> wrote:
Hi,

Why don't you use the Data.Rope library ?
The asymptotic complexities are way better than those of the ByteString functions.

PE

For some operations. I'd expect it to be a constant factor slower on average though.

-- Johan

Kevin Jardine

unread,
Aug 13, 2010, 10:24:43 AM8/13/10
to haskel...@haskell.org
I'm interested to see this kind of open debate on performance,
especially about libraries that provide widely used data structures
such as strings.

One of the more puzzling aspects of Haskell for newbies is the large
number of libraries that appear to provide similar/duplicate
functionality.

The Haskell Platform deals with this to some extent, but it seems to
me that if there are new libraries that appear to provide performance
boosts over more widely used libraries, it would be best if the new
code gets incorporated into the existing more widely used libraries
rather than creating more code to maintain / choose from.

I think that open debate about performance trade-offs could help
consolidate the libraries.

Kevin

On Aug 13, 4:08 pm, Johan Tibell <johan.tib...@gmail.com> wrote:
> On Fri, Aug 13, 2010 at 4:03 PM, Pierre-Etienne Meunier <
>

> pierreetienne.meun...@gmail.com> wrote:
> > Hi,
>
> > Why don't you use the Data.Rope library ?
> > The asymptotic complexities are way better than those of the ByteString
> > functions.
>
> > PE
>
> For some operations. I'd expect it to be a constant factor slower on average
> though.
>
> -- Johan
>

> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-C...@haskell.orghttp://www.haskell.org/mailman/listinfo/haskell-cafe

Johan Tibell

unread,
Aug 13, 2010, 10:43:59 AM8/13/10
to Kevin Jardine, haskel...@haskell.org
On Fri, Aug 13, 2010 at 4:24 PM, Kevin Jardine <kevinj...@gmail.com> wrote:
I'm interested to see this kind of open debate on performance,
especially about libraries that provide widely used data structures
such as strings.

One of the more puzzling aspects of Haskell for newbies is the large
number of libraries that appear to provide similar/duplicate
functionality.

The Haskell Platform deals with this to some extent, but it seems to
me that if there are new libraries that appear to provide performance
boosts over more widely used libraries, it would be best if the new
code gets incorporated into the existing more widely used libraries
rather than creating more code to maintain / choose from.

I think that open debate about performance trade-offs could help
consolidate the libraries.

Kevin

I agree.

Here's a rule of thumb: If you have binary data, use Data.ByteString. If you have text, use Data.Text. Those libraries have benchmarks and have been well tuned by experienced Haskelleres and should be the fastest and most memory compact in most cases. There are still a few cases where String beats Text but they are being worked on as we speak.

Cheers,
Johan

Gábor Lehel

unread,
Aug 13, 2010, 11:25:58 AM8/13/10
to Johan Tibell, Kevin Jardine, haskel...@haskell.org

How about the case for text which is guaranteed to be in ascii/latin1?
ByteString again?

>
> Cheers,


> Johan
>
>
> _______________________________________________
> Haskell-Cafe mailing list

> Haskel...@haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
>

--
Work is punishment for failing to procrastinate effectively.

Sean Leather

unread,
Aug 13, 2010, 11:27:32 AM8/13/10
to Johan Tibell, haskel...@haskell.org
 
Johan Tibell wrote:
Here's a rule of thumb: If you have binary data, use Data.ByteString. If you have text, use Data.Text. Those libraries have benchmarks and have been well tuned by experienced Haskelleres and should be the fastest and most memory compact in most cases. There are still a few cases where String beats Text but they are being worked on as we speak.

Which one do you use for strings in HTML or XML in which UTF-8 has become the commonly accepted standard encoding? It's text, not binary, so I should choose Data.Text. But isn't there a performance penalty for translating from Data.Text's internal 16-bit encoding to UTF-8?

http://tools.ietf.org/html/rfc3629
http://www.utf8.com/

Regards,
Sean

Daniel Fischer

unread,
Aug 13, 2010, 11:49:08 AM8/13/10
to haskel...@haskell.org
On Friday 13 August 2010 17:25:58, Gábor Lehel wrote:
> How about the case for text which is guaranteed to be in ascii/latin1?
> ByteString again?

If you can be sure that that won't change anytime soon, definitely.
Bonus points if you can write the code so that later changing to e.g.
Data.Text requires only a change of imports.

Daniel Fischer

unread,
Aug 13, 2010, 11:52:52 AM8/13/10
to haskel...@haskell.org
On Friday 13 August 2010 17:27:32, Sean Leather wrote:
> Which one do you use for strings in HTML or XML in which UTF-8 has
> become the commonly accepted standard encoding? It's text, not binary,
> so I should choose Data.Text. But isn't there a performance penalty for
> translating from Data.Text's internal 16-bit encoding to UTF-8?

Yes there is.
Whether using String, Data.Text or Data.ByteString + Data.ByteString.UTF8
is the best choice depends on what you do. Test and then decide.

Bryan O'Sullivan

unread,
Aug 13, 2010, 11:57:36 AM8/13/10
to Gábor Lehel, Kevin Jardine, haskel...@haskell.org
2010/8/13 Gábor Lehel <illi...@gmail.com>

How about the case for text which is guaranteed to be in ascii/latin1?
ByteString again?
 
If you know it's text and not binary data you are working with, you should still use Data.Text. There are a few good reasons.
  1. The API is more correct. For instance, if you use Text.toUpper on a string containing latin1 "ß" (eszett, sharp S), you'll get the two-character sequence "SS", which is correct. Using Char8.map Char.toUpper here gives the wrong answer.
  2. In many cases, the API is easier to use, because it's oriented towards using text data, instead of being a port of the list API.
  3. Some commonly used functions, such as substring searching, are way faster than their ByteString counterparts.

Donn Cave

unread,
Aug 13, 2010, 11:57:43 AM8/13/10
to haskel...@haskell.org
Quoth Sean Leather <lea...@cs.uu.nl>,

> Which one do you use for strings in HTML or XML in which UTF-8 has become
> the commonly accepted standard encoding? It's text, not binary, so I should
> choose Data.Text. But isn't there a performance penalty for translating from
> Data.Text's internal 16-bit encoding to UTF-8?

Use both?

I am not familiar with Text, but UTF-8 is pretty awkward, and I will
sure look into Text before wasting any time trying to fine-tune my
ByteString handling for UTF-8.

But in practice only a fraction of my data input will be manipulated
in an encoding-sensitive context. I'm thinking _all_ data is binary,
and accordingly all inputs are ByteString; conversion to Text will
happen as needed for ... uh, wait, is there a conversion from
ByteString to Text? Well, if not, no doubt that's coming.

Donn Cave, do...@avvanta.com

Johan Tibell

unread,
Aug 13, 2010, 12:11:11 PM8/13/10
to Bryan O'Sullivan, haskel...@haskell.org, Kevin Jardine
2010/8/13 Bryan O'Sullivan <b...@serpentine.com>
These are all good reasons. An even more important reason is type safety:

A function that receives a Text argument has the guaranteed that the input is valid Unicode. A function that receives a ByteString doesn't have that guarantee and if validity is important the function must perform a validity check before operating on the data. If the function does not validate the input the function might crash or, even worse, write invalid data to disk or some other data store, corrupting the application data.

This is a bit of a subtle point that you really only see once systems get large. Even though you might pay for the conversion from ByteString to Text you might make up for that by avoiding several validity checks down the road.

Cheers,
Johan

Daniel Fischer

unread,
Aug 13, 2010, 12:55:49 PM8/13/10
to haskel...@haskell.org
On Friday 13 August 2010 17:57:36, Bryan O'Sullivan wrote:
> 3. Some commonly used functions, such as substring searching, are
> *way*faster than their ByteString counterparts.

That's an unfortunate example. Using the stringsearch package, substring
searching in ByteStrings was considerably faster than in Data.Text in my
tests.
Replacing substrings blew Data.Text to pieces even, with a factor of 10-65
between ByteString and Text (and much smaller memory footprint).

stringsearch (Data.ByteString.Lazy.Search):

$ ./bmLazy +RTS -s -RTS ../../bigfile Gutenberg Hutzenzwerg > /dev/null
./bmLazy ../../bigfile Gutenberg Hutzenzwerg +RTS -s
92,045,816 bytes allocated in the heap
31,908 bytes copied during GC
103,368 bytes maximum residency (1 sample(s))
39,992 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)

Generation 0: 158 collections, 0 parallel, 0.01s, 0.00s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed

INIT time 0.00s ( 0.00s elapsed)
MUT time 0.07s ( 0.17s elapsed)
GC time 0.01s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.08s ( 0.17s elapsed)

%GC time 10.5% (2.1% elapsed)

Alloc rate 1,353,535,321 bytes per MUT second

Productivity 89.5% of total user, 40.1% of total elapsed

Data.Text.Lazy:

$ ./textLazy +RTS -s -RTS ../../bigfile Gutenberg Hutzenzwerg > /dev/null
./textLazy ../../bigfile Gutenberg Hutzenzwerg +RTS -s
4,916,133,652 bytes allocated in the heap
6,721,496 bytes copied during GC
12,961,776 bytes maximum residency (58 sample(s))
12,788,968 bytes maximum slop
39 MB total memory in use (1 MB lost due to fragmentation)

Generation 0: 8774 collections, 0 parallel, 0.70s, 0.73s elapsed
Generation 1: 58 collections, 0 parallel, 0.03s, 0.03s elapsed

INIT time 0.00s ( 0.00s elapsed)
MUT time 9.87s ( 10.23s elapsed)
GC time 0.73s ( 0.75s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 10.60s ( 10.99s elapsed)

%GC time 6.9% (6.9% elapsed)

Alloc rate 497,956,181 bytes per MUT second

bigfile is a ~75M file.


The point of the more adequate API for text manipulation stands, of course.

Cheers,
Daniel

Bryan O'Sullivan

unread,
Aug 13, 2010, 1:53:37 PM8/13/10
to Daniel Fischer, haskel...@haskell.org
On Fri, Aug 13, 2010 at 9:55 AM, Daniel Fischer <daniel.i...@web.de> wrote:

That's an unfortunate example. Using the stringsearch package, substring
searching in ByteStrings was considerably faster than in Data.Text in my
tests.

Interesting. Got a test case so I can repro and fix? :-) 

Kevin Jardine

unread,
Aug 13, 2010, 2:16:02 PM8/13/10
to haskel...@haskell.org
This back and forth on performance is great!

I often see ByteString used where Text is theoretically more
appropriate (eg. the Snap web framework) and it would be good to get
these performance issues ironed out so people feel more comfortable
using the right tool for the job based upon API rather than
performance.

Many other languages have two major formats for strings (binary and
text) and it would be great if performance improvements for ByteString
and Text allowed the same kind of convergence for Haskell.

Kevin

On Aug 13, 7:53 pm, "Bryan O'Sullivan" <b...@serpentine.com> wrote:


> On Fri, Aug 13, 2010 at 9:55 AM, Daniel Fischer <daniel.is.fisc...@web.de>wrote:
>
>
>
> > That's an unfortunate example. Using the stringsearch package, substring
> > searching in ByteStrings was considerably faster than in Data.Text in my
> > tests.
>
> Interesting. Got a test case so I can repro and fix? :-)
>

> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-C...@haskell.orghttp://www.haskell.org/mailman/listinfo/haskell-cafe

Daniel Fischer

unread,
Aug 13, 2010, 2:42:05 PM8/13/10
to Bryan O'Sullivan, haskel...@haskell.org

Sure, use http://norvig.com/big.txt (~6.2M), cat it together a few times to
test on larger files.

ByteString code (bmLazy.hs):
----------------------------------------------------------------
{-# LANGUAGE BangPatterns #-}
module Main (main) where

import System.Environment (getArgs)
import qualified Data.ByteString.Char8 as C


import qualified Data.ByteString.Lazy as L

import Data.ByteString.Lazy.Search

main :: IO ()
main = do
(file : pat : _) <- getArgs
let !spat = C.pack pat
work = indices spat
L.readFile file >>= print . length . work
----------------------------------------------------------------
Data.Text.Lazy (textLazy.hs):
----------------------------------------------------------------
{-# LANGUAGE BangPatterns #-}
module Main (main) where

import System.Environment (getArgs)
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as TIO
import Data.Text.Lazy.Search

main :: IO ()
main = do
(file : pat : _) <- getArgs
let !spat = T.pack pat
work = indices spat
TIO.readFile file >>= print . length . work
----------------------------------------------------------------
(Data.Text.Lazy.Search is of course not exposed by default ;), I use
text-0.7.2.1)

Some local timings:

1. real words in a real text file:

$ time ./textLazy big.txt the
92805
0.59user 0.00system 0:00.61elapsed 97%CPU
$ time ./bmLazy big.txt the92805
0.02user 0.01system 0:00.04elapsed 104%CPU

$ time ./textLazy big.txt and
43587
0.56user 0.01system 0:00.58elapsed 100%CPU
$ time ./bmLazy big.txt and
43587
0.02user 0.01system 0:00.03elapsed 88%CPU


$ time ./textLazy big.txt mother
317
0.44user 0.01system 0:00.46elapsed 99%CPU
$ time ./bmLazy big.txt mother
317
0.00user 0.01system 0:00.02elapsed 69%CPU


$ time ./textLazy big.txt deteriorate
2
0.37user 0.00system 0:00.38elapsed 98%CPU
$ time ./bmLazy big.txt deteriorate
2
0.01user 0.01system 0:00.02elapsed 114%CPU

$ time ./textLazy big.txt "Project Gutenberg"
177
0.37user 0.00system 0:00.38elapsed 97%CPU
$ time ./bmLazy big.txt "Project Gutenberg"
177
0.00user 0.01system 0:00.01elapsed 100%CPU

2. periodic pattern in a file of 33.4M of aaaaa:

$ time ./bmLazy ../AAA
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
34999942
1.22user 0.04system 0:01.30elapsed 97%CPU
$ time ./textLazy ../AAA
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
593220
3.07user 0.03system 0:03.14elapsed 98%CPU

Oh, that's closer, but text doesn't find overlapping matches, well, we can
do that too (replace indices with nonOverlappingIndices):

$ time ./noBMLazy ../AAA
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
593220
0.18user 0.04system 0:00.23elapsed 97%CPU

Yeah, that's more like it :D

Daniel Fischer

unread,
Aug 13, 2010, 2:57:07 PM8/13/10
to Bryan O'Sullivan, haskel...@haskell.org

Just occurred to me, a lot of the difference is due to the fact that text
has to convert a ByteString to Text on reading the file, so I timed that by
reading the file and counting the chunks, that took text 0.21s for big.txt
vs. Data.ByteString.Lazy's 0.01s.
So for searching in-memory strings, subtract about 0.032s/MB from the
difference - it's still large.

Kevin Jardine

unread,
Aug 13, 2010, 3:32:12 PM8/13/10
to haskel...@haskell.org
Surely a lot of real world text processing programs are IO intensive?
So if there is no native Text IO and everything needs to be read in /
written out as ByteString data converted to/from Text this strikes me
as a major performance sink.

Or is there native Text IO but just not in your example?

Kevin

On Aug 13, 8:57 pm, Daniel Fischer <daniel.is.fisc...@web.de> wrote:
> Just occurred to me, a lot of the difference is due to the fact that text
> has to convert a ByteString to Text on reading the file, so I timed that by
> reading the file and counting the chunks, that took text 0.21s for big.txt
> vs. Data.ByteString.Lazy's 0.01s.
> So for searching in-memory strings, subtract about 0.032s/MB from the
> difference - it's still large.
> _______________________________________________
> Haskell-Cafe mailing list

> Haskell-C...@haskell.orghttp://www.haskell.org/mailman/listinfo/haskell-cafe

Daniel Fischer

unread,
Aug 13, 2010, 3:52:13 PM8/13/10
to haskel...@haskell.org, Kevin Jardine
On Friday 13 August 2010 21:32:12, Kevin Jardine wrote:
> Surely a lot of real world text processing programs are IO intensive?
> So if there is no native Text IO and everything needs to be read in /
> written out as ByteString data converted to/from Text this strikes me
> as a major performance sink.
>
> Or is there native Text IO but just not in your example?

Outdated information, sorry.
Up to ghc-6.10, text's IO was via ByteString, it's no longer so.
However, the native Text IO is (of course) much slower than ByteString IO
due to the need of en/decoding.

>
> Kevin
>
> On Aug 13, 8:57 pm, Daniel Fischer <daniel.is.fisc...@web.de> wrote:
> > Just occurred to me, a lot of the difference is due to the fact that
> > text has to convert a ByteString to Text on reading the file, so I
> > timed that by reading the file and counting the chunks, that took text
> > 0.21s for big.txt vs. Data.ByteString.Lazy's 0.01s.
> > So for searching in-memory strings, subtract about 0.032s/MB from the
> > difference - it's still large.

_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Ketil Malde

unread,
Aug 13, 2010, 4:28:00 PM8/13/10
to Johan Tibell, Kevin Jardine, haskel...@haskell.org
Johan Tibell <johan....@gmail.com> writes:

> Here's a rule of thumb: If you have binary data, use Data.ByteString. If you
> have text, use Data.Text.

If you have a large amount of mostly ASCII text, use ByteString, since
Data.Text uses twice the storage. Also, ByteString might make more
sense if the data is in a byte-oriented encoding, and the cost of
encoding and decoding utf-16 would be significant.

-k
--
If I haven't seen further, it is by standing in the footprints of giants

Kevin Jardine

unread,
Aug 13, 2010, 4:37:14 PM8/13/10
to haskel...@haskell.org
I find it disturbing that a modern programming language like Haskell
still apparently forces you to choose between a representation for
"mostly ASCII text" and Unicode.

Surely efficient Unicode text should always be the default? And if the
Unicode format used by the Text library is not efficient enough then
can't that be fixed?

Cheers,
Kevin

On Aug 13, 10:28 pm, Ketil Malde <ke...@malde.org> wrote:


> Johan Tibell <johan.tib...@gmail.com> writes:
> > Here's a rule of thumb: If you have binary data, use Data.ByteString. If you
> > have text, use Data.Text.
>
> If you have a large amount of mostly ASCII text, use ByteString, since
> Data.Text uses twice the storage.  Also, ByteString might make more
> sense if the data is in a byte-oriented encoding, and the cost of
> encoding and decoding utf-16 would be significant.
>
> -k
> --
> If I haven't seen further, it is by standing in the footprints of giants
> _______________________________________________
> Haskell-Cafe mailing list

> Haskell-C...@haskell.orghttp://www.haskell.org/mailman/listinfo/haskell-cafe

Don Stewart

unread,
Aug 13, 2010, 4:39:23 PM8/13/10
to Kevin Jardine, haskel...@haskell.org
There are many libraries for many purposes.

How to pick your string library in Haskell
http://blog.ezyang.com/2010/08/strings-in-haskell/

kevinjardine:

Edward Z. Yang

unread,
Aug 13, 2010, 4:47:16 PM8/13/10
to Kevin Jardine, haskell-cafe
Excerpts from Kevin Jardine's message of Fri Aug 13 16:37:14 -0400 2010:

> I find it disturbing that a modern programming language like Haskell
> still apparently forces you to choose between a representation for
> "mostly ASCII text" and Unicode.
>
> Surely efficient Unicode text should always be the default? And if the
> Unicode format used by the Text library is not efficient enough then
> can't that be fixed?

For what it's worth, Java uses UTF-16 representation internally for
strings, and thus also wastes space.

There is something to be said for UTF-8 in-memory representation, but
it takes a lot of care. A newtype for dirty and clean UTF-8 may come
in handy.

Cheers,
Edward

Kevin Jardine

unread,
Aug 13, 2010, 4:51:34 PM8/13/10
to haskel...@haskell.org
Hi Don,

With respect, I disagree with that approach.

Almost every modern programming language has one or at most two
standard representations for strings.

That includes PHP, Python, Ruby, Perl and many others. The lack of a
standard text representation in Haskell has created a crazy patchwork
of incompatible libraries requiring explicit and often inefficient
conversions to connect them together.

I expect Haskell to be higher level than those other languages so that
I can ignore the lower level details and focus on the algorithms. But
in fact the string issue forces me to deal with lower level details
than even PHP requires. I end up with a program littered with ugly
pack, unpack, toString, fromString and similar calls.

That just doesn't feel right to me.

Kevin

On Aug 13, 10:39 pm, Don Stewart <d...@galois.com> wrote:
> There are many libraries for many purposes.
>
>     How to pick your string library in Haskell
>    http://blog.ezyang.com/2010/08/strings-in-haskell/
>
> kevinjardine:
>
> > I find it disturbing that a modern programming language like Haskell
> > still apparently forces you to choose between a representation for
> > "mostly ASCII text" and Unicode.
>
> > Surely efficient Unicode text should always be the default? And if the
> > Unicode format used by the Text library is not efficient enough then
> > can't that be fixed?
>
> > Cheers,
> > Kevin
>

> > On Aug 13, 10:28�pm, Ketil Malde <ke...@malde.org> wrote:
> > > Johan Tibell <johan.tib...@gmail.com> writes:
> > > > Here's a rule of thumb: If you have binary data, use Data.ByteString. If you
> > > > have text, use Data.Text.
>
> > > If you have a large amount of mostly ASCII text, use ByteString, since

> > > Data.Text uses twice the storage. �Also, ByteString might make more


> > > sense if the data is in a byte-oriented encoding, and the cost of
> > > encoding and decoding utf-16 would be significant.
>
> > > -k
> > > --
> > > If I haven't seen further, it is by standing in the footprints of giants
> > > _______________________________________________
> > > Haskell-Cafe mailing list
> > > Haskell-C...@haskell.orghttp://www.haskell.org/mailman/listinfo/haskell-cafe
> > _______________________________________________
> > Haskell-Cafe mailing list

> > Haskell-C...@haskell.org

Ketil Malde

unread,
Aug 13, 2010, 5:04:25 PM8/13/10
to Kevin Jardine, haskel...@haskell.org
Kevin Jardine <kevinj...@gmail.com> writes:

> Almost every modern programming language has one or at most two
> standard representations for strings.

> That includes PHP, Python, Ruby, Perl and many others. The lack of a
> standard text representation in Haskell has created a crazy patchwork
> of incompatible libraries requiring explicit and often inefficient
> conversions to connect them together.

Haskell does have a standard representation for strings, namely [Char].
Unfortunately, this sacrifices efficiency for elegance, which gives rise
to the plethora of libraries.

> I end up with a program littered with ugly
> pack, unpack, toString, fromString and similar calls.

Some of this can be avoided using a language extension that let you
overload string constants.

There are always trade offs, and no one solution will fit all: UTF-8 is
space efficient while UTF-16 is time efficient (at least for certain
classes of problems and data). It does seem that it should be possible
to unify the various libraries wrapping bytestrings (CompactString,
ByteString.UTF8 etc), however.

-k
--
If I haven't seen further, it is by standing in the footprints of giants
_______________________________________________
Haskell-Cafe mailing list

Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Erik de Castro Lopo

unread,
Aug 13, 2010, 6:19:43 PM8/13/10
to haskel...@haskell.org
Pierre-Etienne Meunier wrote:

> Hi,
>
> Why don't you use the Data.Rope library ?
> The asymptotic complexities are way better than those of the
> ByteString functions.

What I see as my current problem is that there is already
a problem having two things Sting and ByteString which
represent strings. Add Text and Data.Rope makes that
problem worse, not better.

Erik de Castro Lopo

unread,
Aug 13, 2010, 6:42:43 PM8/13/10
to haskel...@haskell.org
Kevin Jardine wrote:

> With respect, I disagree with that approach.
>
> Almost every modern programming language has one or at most two
> standard representations for strings.

I think having two makes sense, one for arrays of arbitrary
binary bytes and one for some unicode data format, preferably
UTF-8.



> That includes PHP, Python, Ruby, Perl and many others. The lack of a
> standard text representation in Haskell has created a crazy patchwork
> of incompatible libraries requiring explicit and often inefficient
> conversions to connect them together.
>
> I expect Haskell to be higher level than those other languages so that
> I can ignore the lower level details and focus on the algorithms. But
> in fact the string issue forces me to deal with lower level details
> than even PHP requires. I end up with a program littered with ugly
> pack, unpack, toString, fromString and similar calls.
>
> That just doesn't feel right to me.

That is what I was trying to say whenI started this thread. Thank
you.

Erik
--
----------------------------------------------------------------------
Erik de Castro Lopo
http://www.mega-nerd.com/

Erik de Castro Lopo

unread,
Aug 13, 2010, 6:44:31 PM8/13/10
to haskel...@haskell.org
Ketil Malde wrote:

> Haskell does have a standard representation for strings, namely [Char].
> Unfortunately, this sacrifices efficiency for elegance, which gives rise
> to the plethora of libraries.

To have the default standard representation be one that works
so poorly for many common everyday tasks such as mangling
large chunks of XML is a large part of the problem.

Erik
--
----------------------------------------------------------------------
Erik de Castro Lopo
http://www.mega-nerd.com/

Ivan Lazar Miljenovic

unread,
Aug 13, 2010, 7:03:04 PM8/13/10
to Kevin Jardine, haskel...@haskell.org
Kevin Jardine <kevinj...@gmail.com> writes:

> Hi Don,
>
> With respect, I disagree with that approach.
>
> Almost every modern programming language has one or at most two
> standard representations for strings.

Almost every modern programming language thinks you can whack a print
statement wherever you like... ;-)

> That includes PHP, Python, Ruby, Perl and many others. The lack of a
> standard text representation in Haskell has created a crazy patchwork
> of incompatible libraries requiring explicit and often inefficient
> conversions to connect them together.
>
> I expect Haskell to be higher level than those other languages so that
> I can ignore the lower level details and focus on the algorithms. But
> in fact the string issue forces me to deal with lower level details
> than even PHP requires. I end up with a program littered with ugly
> pack, unpack, toString, fromString and similar calls.

So, the real issue here is that there is not yet a good abstraction over
what we consider to be textual data, and instead people have to code to
a specific data type.

--
Ivan Lazar Miljenovic
Ivan.Mi...@gmail.com
IvanMiljenovic.wordpress.com

Jason Dagit

unread,
Aug 13, 2010, 7:35:48 PM8/13/10
to Ivan Lazar Miljenovic, Kevin Jardine, haskel...@haskell.org
On Fri, Aug 13, 2010 at 4:03 PM, Ivan Lazar Miljenovic <ivan.mi...@gmail.com> wrote:
Kevin Jardine <kevinj...@gmail.com> writes:

> Hi Don,
>
> With respect, I disagree with that approach.
>
> Almost every modern programming language has one or at most two
> standard representations for strings.

Almost every modern programming language thinks you can whack a print
statement wherever you like... ;-)

> That includes PHP, Python, Ruby, Perl and many others. The lack of a
> standard text representation in Haskell has created a crazy patchwork
> of incompatible libraries requiring explicit and often inefficient
> conversions to connect them together.
>
> I expect Haskell to be higher level than those other languages so that
> I can ignore the lower level details and focus on the algorithms. But
> in fact the string issue forces me to deal with lower level details
> than even PHP requires. I end up with a program littered with ugly
> pack, unpack, toString, fromString and similar calls.

So, the real issue here is that there is not yet a good abstraction over
what we consider to be textual data, and instead people have to code to
a specific data type.

Isn't this the same problem we have with numeric literals?  I might even go so far as to suggest it's going to be a problem with all types of literals.

Isn't it also a problem which is partially solved with the OverloadedStrings extension?

It seems like the interface exposed by ByteString could be in a type class.  At that point, would the problem be solved?

Jason

Bryan O'Sullivan

unread,
Aug 13, 2010, 7:41:36 PM8/13/10
to Daniel Fischer, haskel...@haskell.org
On Fri, Aug 13, 2010 at 9:55 AM, Daniel Fischer <daniel.i...@web.de> wrote:

That's an unfortunate example. Using the stringsearch package, substring
searching in ByteStrings was considerably faster than in Data.Text in my
tests.

Daniel, thanks again for bringing up this example! It turned out that quite a lot of the difference in performance was due to an inadvertent space leak in the text search code. With a single added bang pattern, the execution time and space usage both improved markedly.

There is of course still lots of room for improvement, but having test cases like this helps immensely.

Ivan Lazar Miljenovic

unread,
Aug 13, 2010, 7:48:24 PM8/13/10
to da...@codersbase.com, Kevin Jardine, haskel...@haskell.org
Jason Dagit <da...@codersbase.com> writes:

> On Fri, Aug 13, 2010 at 4:03 PM, Ivan Lazar Miljenovic <
>>

>> So, the real issue here is that there is not yet a good abstraction over
>> what we consider to be textual data, and instead people have to code to
>> a specific data type.
>>
>
> Isn't this the same problem we have with numeric literals? I might even go
> so far as to suggest it's going to be a problem with all types of
> literals.

Not just literals; there is no common way of doing a character
replacement (e.g. map toUpper) in a textual type for example.

> Isn't it also a problem which is partially solved with the OverloadedStrings
> extension?
> http://haskell.cs.yale.edu/ghc/docs/6.12.2/html/users_guide/type-class-extensions.html#overloaded-strings

That just convert literals; it doesn't provide a common API.

> It seems like the interface exposed by ByteString could be in a type class.
> At that point, would the problem be solved?

To a certain extent, yes.

There is no one typeclass that could cover everything (especially since
something as simple as toUpper won't work if I understand Bryan's ß ->
SS example), but it would help in the majority of cases.

There has been one attempt, but it doesn't seem very popular (tagsoup
has another, but it's meant to be internal only):
http://hackage.haskell.org/packages/archive/ListLike/latest/doc/html/Data-ListLike.html#39

Brandon S Allbery KF8NH

unread,
Aug 13, 2010, 8:41:09 PM8/13/10
to haskel...@haskell.org
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 8/13/10 16:37 , Kevin Jardine wrote:
> Surely efficient Unicode text should always be the default? And if the

Efficient for what? The most efficient Unicode representation for
Latin-derived strings is UTF-8, but the most efficient for CJK is UTF-16.

- --
brandon s. allbery [linux,solaris,freebsd,perl] all...@kf8nh.com
system administrator [openafs,heimdal,too many hats] all...@ece.cmu.edu
electrical and computer engineering, carnegie mellon university KF8NH
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.10 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkxl5iUACgkQIn7hlCsL25VxzQCgl0lKLIPQwygh/LlUbCq3v2bv
VOcAnR/xJfYBIa1NbNp5VcNk2TlZb1mn
=b9YK
-----END PGP SIGNATURE-----

Evan Laforge

unread,
Aug 13, 2010, 8:51:46 PM8/13/10
to Brandon S Allbery KF8NH, haskel...@haskell.org
On Fri, Aug 13, 2010 at 6:41 PM, Brandon S Allbery KF8NH
<all...@ece.cmu.edu> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 8/13/10 16:37 , Kevin Jardine wrote:
>> Surely efficient Unicode text should always be the default? And if the
>
> Efficient for what?  The most efficient Unicode representation for
> Latin-derived strings is UTF-8, but the most efficient for CJK is UTF-16.

I have an app that is using Data.Text, however I'm thinking of
switching to UTF8 bytestrings. The reasons are that there are two
main things I do with text: pass it to a C API to display, and parse
it. The C API expects UTF8, and the parser libraries with a
reputation for being fast all seem to have bytestring inputs, but not
Data.Text (I'm using unpack -> parsec, which is not optimal).

Dan Doel

unread,
Aug 13, 2010, 9:01:10 PM8/13/10
to haskel...@haskell.org
On Friday 13 August 2010 8:51:46 pm Evan Laforge wrote:
> I have an app that is using Data.Text, however I'm thinking of
> switching to UTF8 bytestrings. The reasons are that there are two
> main things I do with text: pass it to a C API to display, and parse
> it. The C API expects UTF8, and the parser libraries with a
> reputation for being fast all seem to have bytestring inputs, but not
> Data.Text (I'm using unpack -> parsec, which is not optimal).

You should be able to use parsec with text. All you need to do is write a
Stream instance:

instance Monad m => Stream Text m Char where
uncons = return . Text.uncons

-- Dan

Felipe Lessa

unread,
Aug 13, 2010, 9:18:08 PM8/13/10
to Dan Doel, haskel...@haskell.org
On Fri, Aug 13, 2010 at 10:01 PM, Dan Doel <dan....@gmail.com> wrote:
> On Friday 13 August 2010 8:51:46 pm Evan Laforge wrote:
>> I have an app that is using Data.Text, however I'm thinking of
>> switching to UTF8 bytestrings.  The reasons are that there are two
>> main things I do with text: pass it to a C API to display, and parse
>> it.  The C API expects UTF8, and the parser libraries with a
>> reputation for being fast all seem to have bytestring inputs, but not
>> Data.Text (I'm using unpack -> parsec, which is not optimal).
>
> You should be able to use parsec with text. All you need to do is write a
> Stream instance:
>
>  instance Monad m => Stream Text m Char where
>    uncons = return . Text.uncons

Then this should be on a 'parsec-text' package. Instances are always
implicitly imported.

Suppose packages A and B define this instance separately. If
package C imports A and B, then it can't use any of those
instances nor define its own.

Cheers! =)

--
Felipe.

Kevin Jardine

unread,
Aug 14, 2010, 1:29:55 AM8/14/10
to haskel...@haskell.org
On Aug 14, 2:41 am, Brandon S Allbery KF8NH <allb...@ece.cmu.edu>
wrote:

> Efficient for what?  The most efficient Unicode representation for
> Latin-derived strings is UTF-8, but the most efficient for CJK is UTF-16.

I think that this kind of programming detail should be handled
internally (even if necessary by switching automatically from UTF-8 to
UTF-16 depending upon the language).

I'm using Haskell so that I can write high level code. In my view I
should not have to care if the people using my application write in
Farsi, Quechua or Tamil.

Kevin

Andrew Coppin

unread,
Aug 14, 2010, 5:55:25 AM8/14/10
to haskel...@haskell.org
Johan Tibell wrote:
> On Fri, Aug 13, 2010 at 4:24 PM, Kevin Jardine <kevinj...@gmail.com
> <mailto:kevinj...@gmail.com>> wrote:
>
> One of the more puzzling aspects of Haskell for newbies is the large
> number of libraries that appear to provide similar/duplicate
> functionality.
>
>
> I agree.

>
> Here's a rule of thumb: If you have binary data, use Data.ByteString.
> If you have text, use Data.Text. Those libraries have benchmarks and
> have been well tuned by experienced Haskelleres and should be the
> fastest and most memory compact in most cases. There are still a few
> cases where String beats Text but they are being worked on as we speak.

Interesting. I've never even heard of Data.Text. When did that come into
existence?

More importantly: How does the average random Haskeller discover that a
package has become available that might be relevant to their work?

Florian Weimer

unread,
Aug 14, 2010, 6:15:22 AM8/14/10
to Bryan O'Sullivan, haskel...@haskell.org, Kevin Jardine
* Bryan O'Sullivan:

> If you know it's text and not binary data you are working with, you should
> still use Data.Text. There are a few good reasons.
>
> 1. The API is more correct. For instance, if you use Text.toUpper on a
> string containing latin1 "ß" (eszett, sharp S), you'll get the
> two-character sequence "SS", which is correct. Using Char8.map Char.toUpper
> here gives the wrong answer.

Data.Text ist still incorrect for some scripts:

$ LANG=tr_TR.UTF-8 ghci
GHCi, version 6.12.1: http://www.haskell.org/ghc/ :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Prelude> import Data.Text
Prelude Data.Text> toUpper $ pack "i"
Loading package array-0.3.0.0 ... linking ... done.
Loading package containers-0.3.0.0 ... linking ... done.
Loading package deepseq-1.1.0.0 ... linking ... done.
Loading package bytestring-0.9.1.5 ... linking ... done.
Loading package text-0.7.2.1 ... linking ... done.
"I"
Prelude Data.Text>

Ivan Lazar Miljenovic

unread,
Aug 14, 2010, 6:22:04 AM8/14/10
to Andrew Coppin, haskel...@haskell.org
Andrew Coppin <andrew...@btinternet.com> writes:

> Interesting. I've never even heard of Data.Text. When did that come
> into existence?

The first version hit Hackage in February last year...

> More importantly: How does the average random Haskeller discover that
> a package has become available that might be relevant to their work?

Look on Hackage; subscribe to mailing lists (where package maintainers
should really write announcement emails), etc.

It's rather surprising you haven't heard of text: it is for benchmarking
this that Bryan wrote criterion; there's emails on -cafe and blog posts
that mention it on a semi-regular basis, etc.

Johan Tibell

unread,
Aug 14, 2010, 6:27:01 AM8/14/10
to Florian Weimer, Kevin Jardine, haskel...@haskell.org
On Sat, Aug 14, 2010 at 12:15 PM, Florian Weimer <f...@deneb.enyo.de> wrote:
* Bryan O'Sullivan:

> If you know it's text and not binary data you are working with, you should
> still use Data.Text. There are a few good reasons.
>
>    1. The API is more correct. For instance, if you use Text.toUpper on a
>    string containing latin1 "ß" (eszett, sharp S), you'll get the
>    two-character sequence "SS", which is correct. Using Char8.map Char.toUpper
>    here gives the wrong answer.

Data.Text ist still incorrect for some scripts:

$ LANG=tr_TR.UTF-8 ghci
GHCi, version 6.12.1: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Prelude> import Data.Text
Prelude Data.Text> toUpper $ pack "i"
Loading package array-0.3.0.0 ... linking ... done.
Loading package containers-0.3.0.0 ... linking ... done.
Loading package deepseq-1.1.0.0 ... linking ... done.
Loading package bytestring-0.9.1.5 ... linking ... done.
Loading package text-0.7.2.1 ... linking ... done.
"I"
Prelude Data.Text>

Yes. We need locale support for that one. I think Bryan is planning to add it.

-- Johan
 

Andrew Coppin

unread,
Aug 14, 2010, 6:35:31 AM8/14/10
to haskel...@haskell.org
Ivan Lazar Miljenovic wrote:

> Andrew Coppin <andrew...@btinternet.com> writes:
>
>
>> More importantly: How does the average random Haskeller discover that
>> a package has become available that might be relevant to their work?
>>
>
> Look on Hackage; subscribe to mailing lists (where package maintainers
> should really write announcement emails), etc.
>

OK. I guess I must have missed that one...

> It's rather surprising you haven't heard of text: it is for benchmarking
> this that Bryan wrote criterion; there's emails on -cafe and blog posts
> that mention it on a semi-regular basis, etc.
>

Well, I suppose I don't do a lot of text processing work... If all
you're trying to do is parse commands from an interactive terminal
prompt, [Char] is probably good enough.

(What I do do is process big chunks of binary data - which is what
ByteString is intended for.)

Ivan Lazar Miljenovic

unread,
Aug 14, 2010, 6:55:08 AM8/14/10
to Andrew Coppin, haskel...@haskell.org
Andrew Coppin <andrew...@btinternet.com> writes:

> Well, I suppose I don't do a lot of text processing work... If all
> you're trying to do is parse commands from an interactive terminal
> prompt, [Char] is probably good enough.

Neither do I, yet I've heard of it... ;-)

Brandon S Allbery KF8NH

unread,
Aug 14, 2010, 10:35:13 AM8/14/10
to haskel...@haskell.org
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 8/14/10 01:29 , Kevin Jardine wrote:
> I think that this kind of programming detail should be handled
> internally (even if necessary by switching automatically from UTF-8 to
> UTF-16 depending upon the language).

This is going to carry a heavy speed penalty.

> I'm using Haskell so that I can write high level code. In my view I
> should not have to care if the people using my application write in
> Farsi, Quechua or Tamil.

Ideally yes, but arguably the existing Unicode representations don't allow
this to be done nicely. (Of course, arguably there is no "nice" way to do
it; UTF-16 is the best you can do as a workable generic setting.)

- --
brandon s. allbery [linux,solaris,freebsd,perl] all...@kf8nh.com
system administrator [openafs,heimdal,too many hats] all...@ece.cmu.edu
electrical and computer engineering, carnegie mellon university KF8NH
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.10 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkxmqaEACgkQIn7hlCsL25WmOQCfYEjkem99o5IpwxnD7bNaDYyG
768AoK17I605DqDxIdnFUE7MK2ktMtrN
=lOPK
-----END PGP SIGNATURE-----

Donn Cave

unread,
Aug 14, 2010, 11:49:02 AM8/14/10
to haskel...@haskell.org
Quoth Brandon S Allbery KF8NH <all...@ece.cmu.edu>,

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 8/14/10 01:29 , Kevin Jardine wrote:
>> I think that this kind of programming detail should be handled
>> internally (even if necessary by switching automatically from UTF-8 to
>> UTF-16 depending upon the language).

It seems like the right thing, described in the wrong words - wouldn't
it be a more sensible ideal, to simply `switch' depending on the
character encoding?

I mean, to start with, you'd surely wish for some standardization,
so that the difference between UTF-8 and UTF-16 is essentially internal,
while you use the same API indifferently.

Second, a key requirement to effectively work with external data is
support for multiple character encodings. E.g., if Text is internally
UTF-16, it still must be able to input and output UTF-8, and presumably
also UTF-16 where appropriate.

So given full support for _both_ encodings (for example, Text
implementation for `native' UTF-8), and support for input data of
_either_ encoding as encountered at run time ... then the internal
implementation choice should simply follow the external data. For
Chinese inputs you'd be running UTF-16 functions, for French UTF-8.

Donn Cave, do...@avvanta.com

Don Stewart

unread,
Aug 14, 2010, 2:10:27 PM8/14/10
to Andrew Coppin, haskel...@haskell.org
andrewcoppin:

> Interesting. I've never even heard of Data.Text. When did that come into
> existence?
>
> More importantly: How does the average random Haskeller discover that a
> package has become available that might be relevant to their work?

In this case, Data.Text has been announced on this very list several
times:

Text 0.7 announcement
http://www.haskell.org/pipermail/haskell-cafe/2009-December/070866.html

Text 0.5 announcement
http://www.haskell.org/pipermail/haskell-cafe/2009-October/067517.html

Text 0.2 announcement
http://www.haskell.org/pipermail/haskell-cafe/2009-May/061800.html

Text 0.1 annoucnement
http://www.haskell.org/pipermail/haskell-cafe/2009-February/056723.html

As well as on Planet Haskell several times:

Finally! Fast Unicode support for Haskell
http://www.serpentine.com/blog/2009/02/27/finally-fast-unicode-support-for-haskell/

Streaming Unicode support for Haskell: text 0.2
http://www.serpentine.com/blog/2009/05/22/streaming-unicode-support-for-haskell-text-02/

Case conversion and text 0.3
http://www.serpentine.com/blog/2009/06/07/case-conversion-and-text-03/

As well as being presented at Anglo Haskell

http://www.wellquite.org/non-blog/AngloHaskell2008/tom%20harper.pdf

It is mentioned repeatedly in the quarterly Hackage status posts:

"vector and text are quickly rising as the preferred arrays and unicode libraries"
http://donsbot.wordpress.com/2010/04/03/the-haskell-platform-q1-2010-report/

"text has made it into the top 30 libraries"
http://donsbot.wordpress.com/2010/06/30/popular-haskell-packages-q2-2010-report/

Ranked 31st most popular package by June 2010.
http://code.haskell.org/~dons/hackage/Jun-2010/popular.txt

Ranked 41st most popular package by April 2010.
http://www.galois.com/~dons/hackage/april-2010/popularity.csv

Ranked 345th by August 2009
http://www.galois.com/~dons/hackage/august-2009/popularity-august-2009.html

And discussed on Reddit Haskell many times:

http://www.reddit.com/r/haskell/comments/8qfvw/doing_unicode_case_conversion_and_error_recovery/

http://www.reddit.com/r/haskell/comments/80smp/datatext_fast_unicode_bytestrings_with_stream/

http://www.reddit.com/r/haskell/comments/80smp/datatext_fast_unicode_bytestrings_with_stream/

http://www.reddit.com/r/haskell/comments/ade08/the_performance_of_datatext/

So, to stay up to date, but without drowning in data. Do one of:

* Pay attention to Haskell Cafe announcements
* Follow the Reddit Haskell news.
* Read the quarterly reports on Hackage
* Follow Planet Haskell

-- Don

David Menendez

unread,
Aug 14, 2010, 3:26:04 PM8/14/10
to Johan Tibell, Kevin Jardine, haskel...@haskell.org
On Fri, Aug 13, 2010 at 10:43 AM, Johan Tibell <johan....@gmail.com> wrote:
>
> Here's a rule of thumb: If you have binary data, use Data.ByteString. If you
> have text, use Data.Text. Those libraries have benchmarks and have been well
> tuned by experienced Haskelleres and should be the fastest and most memory
> compact in most cases. There are still a few cases where String beats Text
> but they are being worked on as we speak.

It's a good rule, but I don't know how helpful it is to someone doing
XML processing. From what I can tell, the only XML library that uses
Data.Text is libxml-sax, although tagsoup can probably be easily
extended to use it. HXT, HaXml, and xml all use [Char] internally.

--
Dave Menendez <da...@zednenem.com>
<http://www.eyrie.org/~zednenem/>

Yitzchak Gale

unread,
Aug 14, 2010, 6:00:23 PM8/14/10
to Sean Leather, haskel...@haskell.org
Sean Leather wrote:
> Which one do you use for strings in HTML or XML in which UTF-8 has become
> the commonly accepted standard encoding?

UTF-8 is only becoming the standard for non-CJK languages.
We are told by members of our community in CJK countries
that UTF-8 is not widely adopted there, and there is no sign that
it ever will be. And one should be aware that the proportion of
CJK in global Internet traffic is growing quickly.

But of course, that is still a legitimate question for some
situations in which full internationalization will not be needed.

Regards,
Yitz

Sean Leather

unread,
Aug 14, 2010, 6:46:28 PM8/14/10
to Yitzchak Gale, haskel...@haskell.org

Yitzchak Gale wrote:
Sean Leather wrote:
> Which one do you use for strings in HTML or XML in which UTF-8 has become
> the commonly accepted standard encoding?

UTF-8 is only becoming the standard for non-CJK languages.
We are told by members of our community in CJK countries
that UTF-8 is not widely adopted there, and there is no sign that
it ever will be. And one should be aware that the proportion of
CJK in global Internet traffic is growing quickly.

So then, what is the standard? Being not familiar with this area, I googled a bit, and I don't see a consensus. But I also noticeably don't see UTF-16. So, if this is the case, then a similar question still arises for CJK text: What format/library to use for it (assuming one doesn't want a performance penalty for translating between Data.Text's internal format and the target format)? It appears that there are no ideal answers to such questions.

Regards,
Sean

Yitzchak Gale

unread,
Aug 14, 2010, 7:28:31 PM8/14/10
to Sean Leather, haskel...@haskell.org
Sean Leather wrote:
> So then, what is the standard?
> ...I also noticeably don't see UTF-16.

Right there are a handful of language-specific 16-bit encodings
that are popular, from what I understand.

> So, if this is the case, then a similar question still arises for CJK text:
> What format/library to use for it (assuming one doesn't want a performance
> penalty for translating between Data.Text's internal format and the target
> format)? It appears that there are no ideal answers to such questions.

Right. If you know you'll be in a specific encoding - whether UTF-8,
Latin1, one of the CJK encodings, or whatever, it might sometimes
make sense to skip Data.Text and do the IO as raw bytes using
ByteString and then encode/decode manually only when needed.
Otherwise, Data.Text is probably the way to go.

Bryan O'Sullivan

unread,
Aug 14, 2010, 7:38:58 PM8/14/10
to Sean Leather, haskel...@haskell.org
On Sat, Aug 14, 2010 at 3:46 PM, Sean Leather <lea...@cs.uu.nl> wrote:

So then, what is the standard?

There isn't one. There are many national standards:
  • China: GB-2312, GBK and GB18030
  • Taiwan: Big5
  • Japan: JIS and Shift-JIS (0208 and 0213 variants) and EUC-JP
  • Korea: KS-X-2001, EUC-KR, and ISO-2022-KR
In general, Unicode uptake is increasing rapidly: http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html

Being not familiar with this area, I googled a bit, and I don't see a consensus. But I also noticeably don't see UTF-16. So, if this is the case, then a similar question still arises for CJK text: What format/library to use for it (assuming one doesn't want a performance penalty for translating between Data.Text's internal format and the target format)?

In my opinion, this "performance penalty" hand-wringing is mostly silly. We're talking a pretty small factor of performance difference in most of these cases. Even the biggest difference, between ByteString and String, is usually much less than a factor of 100.

Your absolute first concern should be correctness, for which you should (a) use text and (b) assume that any performance issues are being actively worked on, especially if you report concrete problems and how to reproduce them. In the unlikely event that you need to support non-Unicode encodings, they are readily available via text-icu.

The only significant change to the text API that lies ahead is an introduction of locale support in a few critical places, so that we can do the right thing for languages like Turkish.

John Millikin

unread,
Aug 14, 2010, 8:11:48 PM8/14/10
to Bryan O'Sullivan, haskel...@haskell.org
On Sat, Aug 14, 2010 at 16:38, Bryan O'Sullivan <b...@serpentine.com> wrote:
> In my opinion, this "performance penalty" hand-wringing is mostly silly.
> We're talking a pretty small factor of performance difference in most of these
> cases. Even the biggest difference, between ByteString and String, is usually
> much less than a factor of 100.

This attitude towards performance, that it doesn't really matter as
long as something happens *eventually*, is what pushed me away from
Python and towards more performant languages like Haskell in the first
place. Sure, you might not notice a few extra seconds when parsing
some file on your quad-core developer desktop, but those seconds turn
into 20 minutes of lost battery power when running on smaller systems.
Having to convert the internal data structure between [Char], (Ptr
Word16), and (Ptr Word8) can quickly cause user-visible problems.

Libraries which will (by their nature) see heavy use, such as
"bytestring" and "text", ought to have much attention paid to their
performance characteristics. A factor of 2-3x might be the difference
between being able to use a library, and having to rewrite its
functionality to be more efficient.

> In the unlikely event that you need to support non-Unicode encodings,
> they are readily available via text-icu.

Unfortunately, text-icu is hardcoded to use libicu 4.0, which was
released well over a year ago and is no longer available in many
distributions. I sent you a patch to support newer versions a few
months ago, but never received a response. Meanwhile, libicu is up to
4.4 by now.

Yitzchak Gale

unread,
Aug 14, 2010, 8:39:10 PM8/14/10
to Bryan O'Sullivan, haskel...@haskell.org
Bryan O'Sullivan wrote:
> In general, Unicode uptake is increasing rapidly:
http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html

These Google graphs are the oft-quoted source of
Unicode's growing dominance. But the data for those graphs
is taken from Google's own web indexing. Google is a
U.S. company that appears to have a strong Western
culture bias - viz. their recent high-profile struggles with
China. Google is far from being the dominant market
leader in CJK countries that they are in Western countries.
Their level of understanding of those markets is clearly not
the same.

It could be this really is true for CJK countries as well,
or it could be that the data is skewed by Google's web
indexing methods. I won't believe that source until it is
highly corroborated with data and opinions that are native
to CJK countries, from sources that do not have a vested
interest in Unicode adoption.

What we have heard in the past from members of our own
community in CJK countries does not agree at all with
Google's claims, but that may be changing. It would be
great to hear more from them.

Bryan O'Sullivan

unread,
Aug 14, 2010, 9:05:13 PM8/14/10
to John Millikin, haskel...@haskell.org
On Sat, Aug 14, 2010 at 5:11 PM, John Millikin <jmil...@gmail.com> wrote:

This attitude towards performance, that it doesn't really matter as
long as something happens *eventually*, is what pushed me away from
Python and towards more performant languages like Haskell in the first
place.

But wait, wait - I'm not at all contending that performance doesn't matter! In fact, I spent a couple of months working on criterion precisely because I want to base my own performance work on extremely solid data, and to afford the same opportunity to other people. So far in this thread, there's been exactly one performance number posted, by Daniel. Not only have I already thanked him for it, I immediately used (and continue to use) it to improve the performance of the text library in that instance.

More broadly, what I am recommending is simple:
  • Use a good library.
  • Expect good performance out of it.
  • Measure the performance you get out of your application.
  • If it's not good enough, and the fault lies in a library you chose, report a bug and provide a test case.
In the case of the text library, it is often (but not always) competitive with bytestring, and I improve it when I can, especially when given test cases. My goal is for it to be the obvious choice on several fronts:
  • Cleanliness of API, where it's already better, but could still improve
  • Performance, which is not quite where I want it (target: parity with, or better than, bytestring)
  • Quality, where text has slightly more test coverage than bytestring
However, just text alone is a big project, and I could get a lot more done if I was both coding and integrating patches than if coding alone :-) So patches are very welcome.

> In the unlikely event that you need to support non-Unicode encodings,
> they are readily available via text-icu.

Unfortunately, text-icu is hardcoded to use libicu 4.0, which was
released well over a year ago and is no longer available in many
distributions. I sent you a patch to support newer versions a few
months ago, but never received a response.

Yes, that's quite embarrassing, and I am quite apologetic about it, especially since I just asked for help in the preceding paragraph. If it's any help, there's a story behind my apparent sloth: I overenthusiastically accepted a patch from another contributor a few months before yours, and his changes left the text-icu darcs repo in a mess from which I have yet to rescue it. I do still have your patch, and I'll probably abandon my attempts to clean up the other one, as it was more work than I cared to clean it up.

Bryan O'Sullivan

unread,
Aug 14, 2010, 9:07:39 PM8/14/10
to ga...@sefer.org, haskel...@haskell.org
On Sat, Aug 14, 2010 at 5:39 PM, Yitzchak Gale <ga...@sefer.org> wrote:

It could be this really is true for CJK countries as well,
or it could be that the data is skewed by Google's web
indexing methods.

I also wouldn't be surprised if the picture for web-based text is quite different from that for other textual data.

wren ng thornton

unread,
Aug 14, 2010, 9:53:29 PM8/14/10
to haskel...@haskell.org
Yitzchak Gale wrote:
> Bryan O'Sullivan wrote:
>> In general, Unicode uptake is increasing rapidly:
>> http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html
>
> These Google graphs are the oft-quoted source of
> Unicode's growing dominance. But the data for those graphs
> is taken from Google's own web indexing.

Note also that all those encodings near the bottom are remaining
relatively constant. UTF8 is taking its market share from ASCII and
Western European encodings, not so much from other encodings (as yet).

As Bryan mentioned, Unicode doesn't have wide acceptance in CJK
countries. These days, Japanese websites seem to have finally started to
standardize--- in that they use HTTP/HTML headers to say which encoding
the pages are in (and generally use JIS or Shift-JIS). This is a big
step up from a decade ago when non-commercial sites pretty invariably
required fiddling with the browser to get rid of mojibake. Japan hasn't
been bitten by the i18n/l10n bug and they don't have a strong F/OSS
community to drive adoption either.

--
Live well,
~wren

Donn Cave

unread,
Aug 15, 2010, 1:07:07 AM8/15/10
to haskel...@haskell.org
Quoth "Bryan O'Sullivan" <b...@serpentine.com>,

> In the case of the text library, it is often (but not always) competitive
> with bytestring, and I improve it when I can, especially when given test
> cases. My goal is for it to be the obvious choice on several fronts:
>

> - Cleanliness of API, where it's already better, but could still improve
> - Performance, which is not quite where I want it (target: parity with,
> or better than, bytestring)
> - Quality, where text has slightly more test coverage than bytestring

That sounds great, and I'm looking forward to using Text in my
application - at least, where I think it would help with respect
to correctness. I can't imagine I would unpack all my data right
off the socket, or disk, and use Text throughout my application,
because I'm skeptical that unpacking megabytes of data from 8 to
16 bits can be done without noticeable impact on resources. I
wouldn't imagine I would be filing a bug report on that, because
it's a given - if I have a big data load, obviously I should be
using ByteString.

Am I confused about this? It's why I can't see Text ever being
simply the obvious choice. [Char] will continue to be the obvious
choice if you want a functional data type that supports pattern
matching etc. ByteString will continue to be the obvious choice
for big data loads. We'll have a three way choice between programming
elegance, correctness and efficiency. If Haskell were more than
just a research language, this might be its most prominent open
sore, don't you think?

Donn Cave, do...@avvanta.com

John Millikin

unread,
Aug 15, 2010, 1:32:51 AM8/15/10
to Donn Cave, haskel...@haskell.org
On Sat, Aug 14, 2010 at 22:07, Donn Cave <do...@avvanta.com> wrote:
> Am I confused about this?  It's why I can't see Text ever being
> simply the obvious choice.  [Char] will continue to be the obvious
> choice if you want a functional data type that supports pattern
> matching etc.  ByteString will continue to be the obvious choice
> for big data loads.  We'll have a three way choice between programming
> elegance, correctness and efficiency.  If Haskell were more than
> just a research language, this might be its most prominent open
> sore, don't you think?

I don't see why [Char] is "obvious" -- you'd never use [Word8] for
storing binary data, right? [Char] is popular because it's the default
type for string literals, and due to simple inertia, but when there's
a type based on packed arrays there's no reason to use the list
representation.

Also, despite the name, ByteString and Text are for separate purposes.
ByteString is an efficient [Word8], Text is an efficient [Char] -- use
ByteString for binary data, and Text for...text. Most mature languages
have both types, though the choice of UTF-16 for Text is unusual.

Edward Z. Yang

unread,
Aug 15, 2010, 1:39:18 AM8/15/10
to John Millikin, haskell-cafe
Excerpts from John Millikin's message of Sun Aug 15 01:32:51 -0400 2010:

> Also, despite the name, ByteString and Text are for separate purposes.
> ByteString is an efficient [Word8], Text is an efficient [Char] -- use
> ByteString for binary data, and Text for...text. Most mature languages
> have both types, though the choice of UTF-16 for Text is unusual.

Given that both Python, .NET, Java and Windows use UTF-16 for their Unicode
text representations, I cannot really agree with "unusual". :-)

Cheers,
Edward

Michael Snoyman

unread,
Aug 15, 2010, 1:46:12 AM8/15/10
to Edward Z. Yang, haskell-cafe
On Sun, Aug 15, 2010 at 8:39 AM, Edward Z. Yang <ezy...@mit.edu> wrote:
Excerpts from John Millikin's message of Sun Aug 15 01:32:51 -0400 2010:
> Also, despite the name, ByteString and Text are for separate purposes.
> ByteString is an efficient [Word8], Text is an efficient [Char] -- use
> ByteString for binary data, and Text for...text. Most mature languages
> have both types, though the choice of UTF-16 for Text is unusual.

Given that both Python, .NET, Java and Windows use UTF-16 for their Unicode
text representations, I cannot really agree with "unusual". :-)

When I'm writing a web app, my code is sitting on a Linux system where the default encoding is UTF-8, communicating with a database speaking UTF-8, receiving request bodies in UTF-8 and sending response bodies in UTF-8. So converting all of that data to UTF-16, just to be converted right back to UTF-8, does seem strange for that purpose.

Remember, Python, .NET and Java are all imperative languages without referential transparency. I doubt saying they do something some way will influence most Haskell coders much ;).

Michael

John Millikin

unread,
Aug 15, 2010, 1:54:32 AM8/15/10
to Edward Z. Yang, haskell-cafe
On Sat, Aug 14, 2010 at 22:39, Edward Z. Yang <ezy...@mit.edu> wrote:
> Excerpts from John Millikin's message of Sun Aug 15 01:32:51 -0400 2010:
>> Also, despite the name, ByteString and Text are for separate purposes.
>> ByteString is an efficient [Word8], Text is an efficient [Char] -- use
>> ByteString for binary data, and Text for...text. Most mature languages
>> have both types, though the choice of UTF-16 for Text is unusual.
>
> Given that both Python, .NET, Java and Windows use UTF-16 for their Unicode
> text representations, I cannot really agree with "unusual". :-)

Python doesn't use UTF-16; on UNIX systems it uses UCS-4, and on
WIndows it uses UCS-2. The difference is important because:

Python: len("\U0001dd1e") == 2
Haskell: length (pack "\x0001dd1e")

Java, .NET, Windows, JavaScript, and some other languages use UTF-16
because when Unicode support was added to these systems, the astral
characters had not been invented yet, and 16 bits was enough for the
entire Unicode character set. They originally used UCS-2, but then
moved to UTF-16 to minimize incompatibilities.

Anything based on UNIX generally uses UTF-8, because Unicode support
was added later after the problems of UCS-2/UTF-16 had been
discovered. C libraries written by UNIX users use UTF-8 almost
exclusively -- this includes most language bindings available on
Hackage.

I don't mean that UTF-16 is itself unusual, but it's a legacy encoding
-- there's no reason to use it in new projects. If "text" had been
started 15 years ago, I could understand, but since it's still in
active development the use of UTF-16 simply adds baggage.

John Millikin

unread,
Aug 15, 2010, 1:57:05 AM8/15/10
to haskell-cafe
On Sat, Aug 14, 2010 at 22:54, John Millikin <jmil...@gmail.com> wrote:
> Haskell: length (pack "\x0001dd1e")

Apologies -- this line ought to be:

Haskell: Data.Text.length (Data.Text.pack "\x0001dd1e") == 1

Bryan O'Sullivan

unread,
Aug 15, 2010, 2:34:58 AM8/15/10
to Michael Snoyman, haskell-cafe
On Sat, Aug 14, 2010 at 10:46 PM, Michael Snoyman <mic...@snoyman.com> wrote:

When I'm writing a web app, my code is sitting on a Linux system where the default encoding is UTF-8, communicating with a database speaking UTF-8, receiving request bodies in UTF-8 and sending response bodies in UTF-8. So converting all of that data to UTF-16, just to be converted right back to UTF-8, does seem strange for that purpose.

Bear in mind that much of the data you're working with can't be readily trusted. UTF-8 coming from the filesystem, the network, and often the database may not be valid. The cost of validating it isn't all that different from the cost of converting it to UTF-16.

And of course the internals of Data.Text are all fusion-based, so much of the time you're not going to be allocating UTF-16 arrays at all, but instead creating a pipeline of characters that are manipulated in a tight loop. This eliminates a lot of the additional copying that bytestring has to do, for instance.

To give you an idea of how competitive Data.Text can be compared to C code, this is the system's wc command counting UTF-8 characters in a modestly large file:

$ time wc -m huge.txt 
32443330
real 0.728s

This is Data.Text performing the same task:

$ time ./FileRead text huge.txt 
32443330
real 0.697s


Donn Cave

unread,
Aug 15, 2010, 2:46:43 AM8/15/10
to haskel...@haskell.org
Quoth John Millikin <jmil...@gmail.com>,

> I don't see why [Char] is "obvious" -- you'd never use [Word8] for
> storing binary data, right? [Char] is popular because it's the default
> type for string literals, and due to simple inertia, but when there's
> a type based on packed arrays there's no reason to use the list
> representation.

Well, yes, string literals - and pattern matching support, maybe
that's the same thing. And I think it's fair to say that [Char]
is a natural, elegant match for the language, I mean it leverages
your basic Haskell skills if for example you want to parse something
fairly simple. So even if ByteString weren't the monumental hassle
it is today for simple stuff, String would have at least a little appeal.
And if packed arrays really always mattered, [Char] would be long gone.
They don't, you can do a lot of stuff with [Char] before it turns into
a problem.

> Also, despite the name, ByteString and Text are for separate purposes.
> ByteString is an efficient [Word8], Text is an efficient [Char] -- use
> ByteString for binary data, and Text for...text. Most mature languages
> have both types, though the choice of UTF-16 for Text is unusual.

Maybe most mature languages have one or more extra string types
hacked on to support wide characters. I don't think it's necessarily
a virtue. ByteString vs. ByteString.Char8, where you can choose
more or less indiscriminately to treat the data as Char or Word8,
seems to me like a more useful way to approach the problem. (Of
course, ByteString.Char8 isn't a good way to deal with wide characters
correctly, I'm just saying that's where I'd like to find the answer,
not in some internal character encoding into which all "text" data
must be converted.)

Donn Cave, do...@avvanta.com

Bryan O'Sullivan

unread,
Aug 15, 2010, 3:01:16 AM8/15/10
to Donn Cave, haskel...@haskell.org
On Sat, Aug 14, 2010 at 10:07 PM, Donn Cave <do...@avvanta.com> wrote:
 
Am I confused about this?  It's why I can't see Text ever being
simply the obvious choice.  [Char] will continue to be the obvious
choice if you want a functional data type that supports pattern
matching etc.

Actually, with view patterns, Text is pretty nice to pattern match against:

foo (uncons -> Just (c,cs)) = "whee"

despam (prefixed "spam" -> Just suffix) = "whee" `mappend` suffix

ByteString will continue to be the obvious choice
for big data loads.

Don't confuse "I have big data" with "I need bytes". If you are working with bytes, use bytestring. If you are working with text, outside of a few narrow domains you should use text.

 We'll have a three way choice between programming
elegance, correctness and efficiency.  If Haskell were more than
just a research language, this might be its most prominent open
sore, don't you think?

No, that's just FUD. 

Colin Paul Adams

unread,
Aug 15, 2010, 3:34:54 AM8/15/10
to Bryan O'Sullivan, haskell-cafe
>>>>> "Bryan" == Bryan O'Sullivan <b...@serpentine.com> writes:

Bryan> On Sat, Aug 14, 2010 at 10:46 PM, Michael Snoyman <mic...@snoyman.com> wrote:
Bryan> When I'm writing a web app, my code is sitting on a Linux
Bryan> system where the default encoding is UTF-8, communicating
Bryan> with a database speaking UTF-8, receiving request bodies in
Bryan> UTF-8 and sending response bodies in UTF-8. So converting all
Bryan> of that data to UTF-16, just to be converted right back to
Bryan> UTF-8, does seem strange for that purpose.


Bryan> Bear in mind that much of the data you're working with can't
Bryan> be readily trusted. UTF-8 coming from the filesystem, the
Bryan> network, and often the database may not be valid. The cost of
Bryan> validating it isn't all that different from the cost of
Bryan> converting it to UTF-16.

But UTF-16 (apart from being an abomination for creating a hole in the
codepoint space and making it impossible to ever etxend it) is slow to
process compared with UTF-32 - you can't get the nth character in
constant time, so it seems an odd choice to me.
--
Colin Adams
Preston Lancashire
() ascii ribbon campaign - against html e-mail
/\ www.asciiribbon.org - against proprietary attachments

Johan Tibell

unread,
Aug 15, 2010, 3:40:02 AM8/15/10
to Colin Paul Adams, haskell-cafe
Hi Colin,

On Sun, Aug 15, 2010 at 9:34 AM, Colin Paul Adams <co...@colina.demon.co.uk> wrote:
But UTF-16 (apart from being an abomination for creating a hole in the
codepoint space and making it impossible to ever etxend it) is slow to
process compared with UTF-32 - you can't get the nth character in
constant time, so it seems an odd choice to me.

Aside: Getting the nth character isn't very useful when working with Unicode text:

* Most text processing is linear.
* What we consider a character and what Unicode considers a character differs a bit e.g. since Unicode uses combining characters.

Cheers,
Johan

Ivan Lazar Miljenovic

unread,
Aug 15, 2010, 3:45:33 AM8/15/10
to Don Stewart, haskel...@haskell.org
Don Stewart <do...@galois.com> writes:

> * Pay attention to Haskell Cafe announcements
> * Follow the Reddit Haskell news.
> * Read the quarterly reports on Hackage
> * Follow Planet Haskell

And yet there are still many packages that fall under the radar with no
announcements of any kind on initial release or even new versions :(

Vo Minh Thu

unread,
Aug 15, 2010, 5:55:32 AM8/15/10
to Ivan Lazar Miljenovic, haskel...@haskell.org
2010/8/15 Ivan Lazar Miljenovic <ivan.mi...@gmail.com>:

> Don Stewart <do...@galois.com> writes:
>
>>     * Pay attention to Haskell Cafe announcements
>>     * Follow the Reddit Haskell news.
>>     * Read the quarterly reports on Hackage
>>     * Follow Planet Haskell
>
> And yet there are still many packages that fall under the radar with no
> announcements of any kind on initial release or even new versions :(

If you're interested in a comprehensive update list, you can follow
Hackage on Twitter, or the news feed.

Cheers,
Thu

Andrew Coppin

unread,
Aug 15, 2010, 6:08:26 AM8/15/10
to haskel...@haskell.org
Don Stewart wrote:
> So, to stay up to date, but without drowning in data. Do one of:
>
> * Pay attention to Haskell Cafe announcements
> * Follow the Reddit Haskell news.
> * Read the quarterly reports on Hackage
> * Follow Planet Haskell
>

Interesting. Obviously I look at Haskell Cafe from time to time
(although there's usually far too much traffic to follow it all). I
wasn't aware of *any* of the other resources listed.

Ivan Lazar Miljenovic

unread,
Aug 15, 2010, 6:17:55 AM8/15/10
to Vo Minh Thu, haskel...@haskell.org
Vo Minh Thu <not...@gmail.com> writes:

> 2010/8/15 Ivan Lazar Miljenovic <ivan.mi...@gmail.com>:
>> Don Stewart <do...@galois.com> writes:
>>
>>>     * Pay attention to Haskell Cafe announcements
>>>     * Follow the Reddit Haskell news.
>>>     * Read the quarterly reports on Hackage
>>>     * Follow Planet Haskell
>>
>> And yet there are still many packages that fall under the radar with no
>> announcements of any kind on initial release or even new versions :(
>
> If you're interested in a comprehensive update list, you can follow
> Hackage on Twitter, or the news feed.

Except that that doesn't tell you:

* The purpose of the library
* How a release differs from a previous one
* Why you should use it, etc.

Furthermore, several interesting discussions have arisen out of
announcement emails.

Vo Minh Thu

unread,
Aug 15, 2010, 7:02:14 AM8/15/10
to Ivan Lazar Miljenovic, haskel...@haskell.org
2010/8/15 Ivan Lazar Miljenovic <ivan.mi...@gmail.com>:
> Vo Minh Thu <not...@gmail.com> writes:
>
>> 2010/8/15 Ivan Lazar Miljenovic <ivan.mi...@gmail.com>:
>>> Don Stewart <do...@galois.com> writes:
>>>
>>>>     * Pay attention to Haskell Cafe announcements
>>>>     * Follow the Reddit Haskell news.
>>>>     * Read the quarterly reports on Hackage
>>>>     * Follow Planet Haskell
>>>
>>> And yet there are still many packages that fall under the radar with no
>>> announcements of any kind on initial release or even new versions :(
>>
>> If you're interested in a comprehensive update list, you can follow
>> Hackage on Twitter, or the news feed.
>
> Except that that doesn't tell you:
>
> * The purpose of the library
> * How a release differs from a previous one
> * Why you should use it, etc.
>
> Furthermore, several interesting discussions have arisen out of
> announcement emails.

Sure, nor does it write a book chapter about some practical usage. I
mean (tongue in cheek) that the other ressource, nor even some proper
annoucement, provide all that.

I still remember the UHC annoucement (a (nearly) complete Haskell 98
compiler) thread where most of it was about lack of support for n+k
pattern.

But the bullet list above was to point Andrew a few places where he
could have learn about Text.

Cheers,
Thu

Brandon S Allbery KF8NH

unread,
Aug 15, 2010, 11:17:12 AM8/15/10
to haskel...@haskell.org
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

More to the point, there's nothing elegant about [Char] --- its sole
"advantage" is requiring no thought.

- --
brandon s. allbery [linux,solaris,freebsd,perl] all...@kf8nh.com
system administrator [openafs,heimdal,too many hats] all...@ece.cmu.edu
electrical and computer engineering, carnegie mellon university KF8NH
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.10 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkxoBPgACgkQIn7hlCsL25WbWACgz+MXfwL6ly1Euv1X1HD7Gmg8
fO0Anj1LY6CqDyLjr0s5L2M5Okx8ie+/
=eIIs
-----END PGP SIGNATURE-----

Bill Atkins

unread,
Aug 15, 2010, 11:25:55 AM8/15/10
to Brandon S Allbery KF8NH, haskel...@haskell.org
No, not really.  Linked lists are very easy to deal with recursively and Strings automatically work with any already-defined list functions.

Donn Cave

unread,
Aug 15, 2010, 11:50:23 AM8/15/10
to haskel...@haskell.org
Quoth "Bryan O'Sullivan" <b...@serpentine.com>,
> On Sat, Aug 14, 2010 at 10:07 PM, Donn Cave <do...@avvanta.com> wrote:
...

>> ByteString will continue to be the obvious choice
>> for big data loads.
>
> Don't confuse "I have big data" with "I need bytes". If you are working with
> bytes, use bytestring. If you are working with text, outside of a few narrow
> domains you should use text.

I wonder how many ByteString users are `working with bytes', in the
sense you apparently mean where the bytes are not text characters.
My impression is that in practice, there is a sizeable contingent
out here using ByteString.Char8 and relatively few applications for
the Word8 type. Some of it should no doubt move to Text, but the
ability to work with native packed data - minimal processing and
space requirements, interoperability with foreign code, mmap, etc. -
is attractive enough that the choice can be less than obvious.

Donn Cave

unread,
Aug 15, 2010, 12:06:22 PM8/15/10
to haskel...@haskell.org
Quoth Bill Atkins <wat...@alum.rpi.edu>,

> No, not really. Linked lists are very easy to deal with recursively and
> Strings automatically work with any already-defined list functions.

Yes, they're great - a terrible mistake, for a practical programming
language, but if you fail to recognize the attraction, you miss some of
the historical lesson on emphasizing elegance and correctness over
practical performance.

Donn Cave, do...@avvanta.com

Felipe Lessa

unread,
Aug 15, 2010, 12:06:53 PM8/15/10
to Donn Cave, haskel...@haskell.org
On Sun, Aug 15, 2010 at 12:50 PM, Donn Cave <do...@avvanta.com> wrote:
> I wonder how many ByteString users are `working with bytes', in the
> sense you apparently mean where the bytes are not text characters.
> My impression is that in practice, there is a sizeable contingent
> out here using ByteString.Char8 and relatively few applications for
> the Word8 type.  Some of it should no doubt move to Text, but the
> ability to work with native packed data - minimal processing and
> space requirements, interoperability with foreign code, mmap, etc. -
> is attractive enough that the choice can be less than obvious.

Using ByteString.Char8 doesn't mean your data isn't a stream of bytes,
it means that it is a stream of bytes but for convenience you prefer
using Char8 functions. For example, a DNA sequence (AATCGATACATG...)
is a stream of bytes, but it is better to write 'A' than 65.

But yes, many users of ByteStrings should be using Text. =)

Cheers!

--
Felipe.

Brandon S Allbery KF8NH

unread,
Aug 15, 2010, 12:33:34 PM8/15/10
to Bill Atkins, haskel...@haskell.org
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Except that it seems to me that a number of functions in Data.List are
really functions on Strings and not especially useful on generic lists.
There is overlap but it's not as large as might be thought.

- --
brandon s. allbery [linux,solaris,freebsd,perl] all...@kf8nh.com
system administrator [openafs,heimdal,too many hats] all...@ece.cmu.edu
electrical and computer engineering, carnegie mellon university KF8NH
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.10 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkxoFt4ACgkQIn7hlCsL25V+OACfXngN6ZX5L7AL153AkRYDFnqZ
jqsAnA3Lem5LioDVS5bc0ADGzHwWsKFE
=ehkx

Andrew Coppin

unread,
Aug 15, 2010, 1:09:44 PM8/15/10
to haskel...@haskell.org
Donn Cave wrote:
> I wonder how many ByteString users are `working with bytes', in the
> sense you apparently mean where the bytes are not text characters.
> My impression is that in practice, there is a sizeable contingent
> out here using ByteString.Char8 and relatively few applications for
> the Word8 type. Some of it should no doubt move to Text, but the
> ability to work with native packed data - minimal processing and
> space requirements, interoperability with foreign code, mmap, etc. -
> is attractive enough that the choice can be less than obvious.
>

I use ByteString for various binary-processing stuff. I also use it for
string-processing, but that's mainly because I didn't know anything else
existed. I'm sure lots of other people are using stuff like Data.Binary
to serialise raw binary data using ByteString too.

Andrew Coppin

unread,
Aug 15, 2010, 1:53:39 PM8/15/10
to haskel...@haskell.org
Donn Cave wrote:
> Quoth Bill Atkins <wat...@alum.rpi.edu>,
>
>
>> No, not really. Linked lists are very easy to deal with recursively and
>> Strings automatically work with any already-defined list functions.
>>
>
> Yes, they're great - a terrible mistake, for a practical programming
> language, but if you fail to recognize the attraction, you miss some of
> the historical lesson on emphasizing elegance and correctness over
> practical performance.
>

And if you fail to recognise what a grave mistake placing performance
before correctness is, you end up with things like buffer overflow
exploits, SQL injection attacks, the Y2K bug, programs that can't handle
files larger than 2GB or that don't understand Unicode, and so forth.
All things that could have been almost trivially avoided if everybody
wasn't so hung up on absolute performance at any cost.

Sure, performance is a priority. But it should never be the top
priority. ;-)

Brandon S Allbery KF8NH

unread,
Aug 15, 2010, 2:01:12 PM8/15/10
to haskel...@haskell.org
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 8/15/10 13:53 , Andrew Coppin wrote:
> injection attacks, the Y2K bug, programs that can't handle files larger than
> 2GB or that don't understand Unicode, and so forth. All things that could
> have been almost trivially avoided if everybody wasn't so hung up on
> absolute performance at any cost.

Now that's a bit unfair; nobody imagined back when lseek() was enshrined in
the Unix API that it would still be in use when a (long) wasn't big enough
:) (Remember that Unix is itself a practical example of a research platform
"avoiding success at any cost" gone horribly wrong.)

- --
brandon s. allbery [linux,solaris,freebsd,perl] all...@kf8nh.com
system administrator [openafs,heimdal,too many hats] all...@ece.cmu.edu
electrical and computer engineering, carnegie mellon university KF8NH
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.10 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkxoK2gACgkQIn7hlCsL25VaHgCcCj8T8Qqfx4Co1lXZCH7BApkW
iI8AoNcSabjLso9nXBfujeI+diC8rM78
=FwBb
-----END PGP SIGNATURE-----

Bryan O'Sullivan

unread,
Aug 15, 2010, 2:04:01 PM8/15/10
to John Millikin, haskel...@haskell.org
On Sat, Aug 14, 2010 at 6:05 PM, Bryan O'Sullivan <b...@serpentine.com> wrote:
  • If it's not good enough, and the fault lies in a library you chose, report a bug and provide a test case.
As a case in point, I took the string search benchmark that Daniel shared on Friday, and boiled it down to a simple test case: how long does it take to read a 31MB file?

GNU wc -m:
  • en_US.UTF-8: 0.701s
text 0.7.1.0:
  • lazy text: 1.959s
  • strict text: 3.527s
darcs HEAD:
  • lazy text: 0.749s
  • strict text: 0.927s

Andrew Coppin

unread,
Aug 15, 2010, 2:34:49 PM8/15/10
to haskel...@haskell.org
Brandon S Allbery KF8NH wrote:
> (Remember that Unix is itself a practical example of a research platform
> "avoiding success at any cost" gone horribly wrong.)
>

I haven't used Erlang myself, but I've heard it described in a similar
way. (I don't know how true that actually is...)

Daniel Fischer

unread,
Aug 15, 2010, 2:39:24 PM8/15/10
to haskel...@haskell.org
On Sunday 15 August 2010 20:04:01, Bryan O'Sullivan wrote:
> On Sat, Aug 14, 2010 at 6:05 PM, Bryan O'Sullivan
<b...@serpentine.com>wrote:
> > - If it's not good enough, and the fault lies in a library you

> > chose, report a bug and provide a test case.
> >
> As a case in point, I took the string search benchmark that Daniel shared
> on Friday, and boiled it down to a simple test case: how long does it
> take to read a 31MB file?
>
> GNU wc -m:
>
> - en_US.UTF-8: 0.701s
>
> text 0.7.1.0:
>
> - lazy text: 1.959s
> - strict text: 3.527s
>
> darcs HEAD:
>
> - lazy text: 0.749s
> - strict text: 0.927s

That's great. If that performance difference is a show stopper, one
shouldn't go higher-level than C anyway :)
(doesn't mean one should stop thinking about further speed-up, though)

Out of curiosity, what kind of speed-up did your Friday fix bring to the
searching/replacing functions?

Brandon S Allbery KF8NH

unread,
Aug 15, 2010, 2:48:32 PM8/15/10
to haskel...@haskell.org
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 8/15/10 14:34 , Andrew Coppin wrote:

> Brandon S Allbery KF8NH wrote:
>> (Remember that Unix is itself a practical example of a research platform
>> "avoiding success at any cost" gone horribly wrong.)
>
> I haven't used Erlang myself, but I've heard it described in a similar way.
> (I don't know how true that actually is...)

Similar case, actually: internal research project with internal practical
uses, then got discovered and "productized" by a different internal group.

- --
brandon s. allbery [linux,solaris,freebsd,perl] all...@kf8nh.com
system administrator [openafs,heimdal,too many hats] all...@ece.cmu.edu
electrical and computer engineering, carnegie mellon university KF8NH
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.10 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkxoNoAACgkQIn7hlCsL25XSAgCgtLKTtT8YN99KsArnhW2kMDvh
oHcAnR1QrfIaq3hmzqU7yF31NZubEMsR
=zpv1
-----END PGP SIGNATURE-----

Bryan O'Sullivan

unread,
Aug 15, 2010, 2:53:32 PM8/15/10
to Daniel Fischer, haskel...@haskell.org
On Sun, Aug 15, 2010 at 11:39 AM, Daniel Fischer <daniel.i...@web.de> wrote:
Out of curiosity, what kind of speed-up did your Friday fix bring to the
searching/replacing functions?

Quite a bit!

text 0.7.1.0 and 0.7.2.1:
  • 1.056s
darcs HEAD:
  • 0.158s

Donn Cave

unread,
Aug 15, 2010, 3:01:20 PM8/15/10
to haskel...@haskell.org
Quoth Andrew Coppin <andrew...@btinternet.com>,
...

> And if you fail to recognise what a grave mistake placing performance
> before correctness is, you end up with things like buffer overflow
> exploits, SQL injection attacks, the Y2K bug, programs that can't handle
> files larger than 2GB or that don't understand Unicode, and so forth.
> All things that could have been almost trivially avoided if everybody
> wasn't so hung up on absolute performance at any cost.
>
> Sure, performance is a priority. But it should never be the top
> priority. ;-)

You should never have to choose. Not to belabor the point, but to
dismiss all that as the work of morons who weren't as wise as we are,
is the same mistake from the other side of the wall - performance counts.
If you solve the problem by assigning a priority to one or the other,
you aren't solving the problem.

Donn Cave, do...@avvanta.com

Daniel Fischer

unread,
Aug 15, 2010, 3:12:09 PM8/15/10
to Bryan O'Sullivan, haskel...@haskell.org
On Sunday 15 August 2010 20:53:32, Bryan O'Sullivan wrote:
> On Sun, Aug 15, 2010 at 11:39 AM, Daniel Fischer
>
> <daniel.i...@web.de>wrote:
> > Out of curiosity, what kind of speed-up did your Friday fix bring to
> > the searching/replacing functions?
>
> Quite a bit!
>
> text 0.7.1.0 and 0.7.2.1:
>
> - 1.056s
>
> darcs HEAD:
>
> - 0.158s

Awesome :D

Gregory Collins

unread,
Aug 15, 2010, 3:27:12 PM8/15/10
to Ivan Lazar Miljenovic, haskel...@haskell.org
Ivan Lazar Miljenovic <ivan.mi...@gmail.com> writes:

> Don Stewart <do...@galois.com> writes:
>
>> * Pay attention to Haskell Cafe announcements
>> * Follow the Reddit Haskell news.
>> * Read the quarterly reports on Hackage
>> * Follow Planet Haskell
>
> And yet there are still many packages that fall under the radar with no
> announcements of any kind on initial release or even new versions :(

Subscribe to http://hackage.haskell.org/packages/archive/recent.rss in
your RSS reader: problem solved!

G
--
Gregory Collins <gr...@gregorycollins.net>

wren ng thornton

unread,
Aug 15, 2010, 8:06:22 PM8/15/10
to Bryan O'Sullivan, haskel...@haskell.org
Bryan O'Sullivan wrote:
> As a case in point, I took the string search benchmark that Daniel shared
> on Friday, and boiled it down to a simple test case: how long does it take
> to read a 31MB file?
>
> GNU wc -m:
>
> - en_US.UTF-8: 0.701s
>
> text 0.7.1.0:
>
> - lazy text: 1.959s
> - strict text: 3.527s
>
> darcs HEAD:
>
> - lazy text: 0.749s
> - strict text: 0.927s

When should we expect to see the HEAD stamped and numbered? After some
of the recent benchmark dueling re web frameworks, I know Text got a bad
rap compared to ByteString. It'd be good to stop the FUD early.
Repeating the above in the announcement should help a lot.

--
Live well,
~wren

Don Stewart

unread,
Aug 15, 2010, 8:10:54 PM8/15/10
to wren ng thornton, haskel...@haskell.org
wren:

> Bryan O'Sullivan wrote:
>> As a case in point, I took the string search benchmark that Daniel shared
>> on Friday, and boiled it down to a simple test case: how long does it take
>> to read a 31MB file?
>>
>> GNU wc -m:
>>
>> - en_US.UTF-8: 0.701s
>>
>> text 0.7.1.0:
>>
>> - lazy text: 1.959s
>> - strict text: 3.527s
>>
>> darcs HEAD:
>>
>> - lazy text: 0.749s
>> - strict text: 0.927s
>
> When should we expect to see the HEAD stamped and numbered? After some
> of the recent benchmark dueling re web frameworks, I know Text got a bad
> rap compared to ByteString. It'd be good to stop the FUD early.
> Repeating the above in the announcement should help a lot.

For what its worth, for several bytestring announcements I published
comprehensive function-by-function comparisions of performance on
enormous data sets, until there was unambiguous evidence bytestring was
faster than List.

E.g http://www.mail-archive.com/has...@haskell.org/msg18596.html

Ivan Lazar Miljenovic

unread,
Aug 15, 2010, 11:29:21 PM8/15/10
to Gregory Collins, haskel...@haskell.org
Gregory Collins <gr...@gregorycollins.net> writes:

> Ivan Lazar Miljenovic <ivan.mi...@gmail.com> writes:
>
>> Don Stewart <do...@galois.com> writes:
>>
>>> * Pay attention to Haskell Cafe announcements
>>> * Follow the Reddit Haskell news.
>>> * Read the quarterly reports on Hackage
>>> * Follow Planet Haskell
>>
>> And yet there are still many packages that fall under the radar with no
>> announcements of any kind on initial release or even new versions :(
>
> Subscribe to http://hackage.haskell.org/packages/archive/recent.rss in
> your RSS reader: problem solved!

As I said in reply to someone else: that won't help you get the intent
of a library, how it has changed from previous versions, etc.

Bulat Ziganshin

unread,
Aug 16, 2010, 1:33:21 AM8/16/10
to Bryan O'Sullivan, haskel...@haskell.org
Hello Bryan,

Sunday, August 15, 2010, 10:04:01 PM, you wrote:

> shared on Friday, and boiled it down to a simple test case: how long does it take to read a 31MB file?
> GNU wc -m:

there are even slower ways to do it if you need :)

if your data aren't cached, then speed is limited by HDD. if your data
are cached, it should be 20-50x faster. try cat >nul


--
Best regards,
Bulat mailto:Bulat.Z...@gmail.com

Bulat Ziganshin

unread,
Aug 16, 2010, 1:35:44 AM8/16/10
to Daniel Fischer, haskel...@haskell.org
Hello Daniel,

Sunday, August 15, 2010, 10:39:24 PM, you wrote:

> That's great. If that performance difference is a show stopper, one
> shouldn't go higher-level than C anyway :)

*all* speed measurements that find Haskell is as fast as C, was
broken. Let's see:

D:\testing>read MsOffice.arc
MsOffice.arc 317mb -- Done
Time 0.407021 seconds (timer accuracy 0.000000 seconds)
Speed 779.505632 mbytes/sec


--
Best regards,
Bulat mailto:Bulat.Z...@gmail.com

_______________________________________________

Daniel Fischer

unread,
Aug 16, 2010, 8:44:33 AM8/16/10
to Bulat Ziganshin, haskel...@haskell.org
Hi Bulat,

On Monday 16 August 2010 07:35:44, Bulat Ziganshin wrote:
> Hello Daniel,
>
> Sunday, August 15, 2010, 10:39:24 PM, you wrote:
> > That's great. If that performance difference is a show stopper, one
> > shouldn't go higher-level than C anyway :)
>
> *all* speed measurements that find Haskell is as fast as C, was
> broken.

That's a pretty bold claim, considering that you probably don't know all
such measurements ;)

But let's get serious. Bryan posted measurements showing the text (HEAD)
package's performance within a reasonable factor of wc's. (Okay, he didn't
give a complete description of his test, so we can only assume that all
participants did the same job. I'm bold enough to assume that.)
Lazy text being 7% slower than wc, strict 30%.

If you are claiming that his test was flawed (and since the numbers clearly
showed Haskell slower thanC, just not much, I suspect you do, otherwise I
don't see the point of your post), could you please elaborate why you think
it's flawed?

> Let's see:
>
> D:\testing>read MsOffice.arc
> MsOffice.arc 317mb -- Done
> Time 0.407021 seconds (timer accuracy 0.000000 seconds)
> Speed 779.505632 mbytes/sec

I see nothing here, not knowing what `read' is. None of read (n), read (2),
read (1p), read(3p) makes sense here, so it must be something else.
Since it outputs a size in bytes, I doubt that it actually counts
characters, like wc -m and, presumably, the text programmes Bryan
benchmarked.
Just counting bytes, wc and Data.ByteString[.Lazy] can do much faster than
counting characters too.

Benedikt Huber

unread,
Aug 16, 2010, 7:55:32 PM8/16/10
to Daniel Fischer, Bulat Ziganshin, haskel...@haskell.org
On 16.08.10 14:44, Daniel Fischer wrote:
> Hi Bulat,
> On Monday 16 August 2010 07:35:44, Bulat Ziganshin wrote:
>> Hello Daniel,
>>
>> Sunday, August 15, 2010, 10:39:24 PM, you wrote:
>>> That's great. If that performance difference is a show stopper, one
>>> shouldn't go higher-level than C anyway :)
>>
>> *all* speed measurements that find Haskell is as fast as C, was
>> broken.
>
> That's a pretty bold claim, considering that you probably don't know all
> such measurements ;)
>
> [...]

> If you are claiming that his test was flawed (and since the numbers clearly
> showed Haskell slower than C, just not much, I suspect you do, otherwise I

> don't see the point of your post), could you please elaborate why you think
> it's flawed?
Hi Daniel,
you are right, the throughput of 'cat' (as proposed by Bulat) is not a
fair comparison, and 'all speed measurements favoring haskell are
broken' is hardly a reasonable argument. However, 'wc -m' is indeed a
rather slow way to count the number of UTF-8 characters. Python, for
example, is quite a bit faster (1.60s vs 0.93s for 70M) on my
machine[1,2]. Despite of all this, I think the performance of the text
package is very promising, and hope it will improve further!

cheers, benedikt

[1] A special purpose C implementation (as the one presented here:
http://canonical.org/~kragen/strlen-utf8.html) is even faster (0.50),
but that's not a fair comparison either.
[2] I do not know Python, so maybe there is an even faster way than
print len(sys.stdin.readline().decode('utf-8'))

It is loading more messages.
0 new messages