consuming file contents in blocks

Francesco Bochicchio

unread,

Jan 11, 2009, 7:33:22 AM1/11/09

to

Hi list,

I'm trying to learn some haskell by giving myself little tasks.
One such tasks (borrowed from another newsgroup) requires to read a huge
file (2GB) and process it in chunks of 2KB.
I was trying to do it using a lazy non-imperative style,
and rather than reading one block at the time and process it, I wrote the
following code to build a list of blocks and process it. But it seems that
this code stil causes all file contents to be loaded in memory, because
the program dies for an out of memory error (after running for about 10
minutes).

I was under the impression that analyzeContents would drive the consuming
of the file contents one block each time, and therefore only that block
would have been loaded in memory by haskell.
Clearly there is something I don't understand going on here - any hint
would be much appreciated.

Ciao
---
FB

--
-- Code begin
--

processBlock [] = 0 -- TBW
processBlock block = head block -- TBW

getBlocks [] = []
getBlocks bytes = block : getBlocks rest
where
bs = 2048
(block, rest) = splitAt bs bytes

analyzeContents bytes = [ processBlock block | block <- (getBlocks bytes) ]

main = do
arglist <- System.Environment.getArgs
let fname = head arglist
bytes <- readFile fname
let pixels = analyzeContents bytes
-- TBW
print "Done."

Ertugrul Söylemez

unread,

Jan 11, 2009, 3:37:53 PM1/11/09

to

Francesco Bochicchio <boc...@virgilio.it> wrote:

> I'm trying to learn some haskell by giving myself little tasks. One
> such tasks (borrowed from another newsgroup) requires to read a huge
> file (2GB) and process it in chunks of 2KB.
>

> [...]

Besides lots of little beginner's typical style glitches, I don't see
any problem with your code. In fact, with a few changes to make it
compile, it works well here. I'm using GHC 6.10.1 on Linux, but I think
that it should work equally well with every other GHC. Hugs failed,
because I used a binary file as input, suggesting to use binary I/O.

Your idea of lazy I/O was correct and your implementation makes sense.
However, don't expect this code to really read the file in 2 KiB blocks,
because IO code uses internal buffering. In fact, you may get a much
simpler and more concise solution by not using lazy I/O, but instead
strict binary I/O. Strict I/O is almost always faster anyway, and it
allows you to handle read errors properly.

> processBlock [] = 0

> processBlock block = head block

It's always better to add type signatures. They make your code much
clearer. Next thing is, Char is not an instance of Num, so you can't
use numbers right away. Some compilers may allow that, GHC doesn't.
Further, instead of using fully qualified names, you generally wish to
import the corresponding module. Last but not least, for completeness
you may want to give your module a name explicitly.

module Main where

import Data.Char
import System.Environment

processBlock :: String -> Char
processBlock [] = chr 0

processBlock block = head block

> getBlocks [] = []
> getBlocks bytes = block : getBlocks rest
> where
> bs = 2048
> (block, rest) = splitAt bs bytes

There is nothing wrong with this code other than that personally I
wouldn't be as verbose as giving 2048 a name, instead I would make this
number a parameter. Of course, that's a matter of taste. And BTW, I'd
rename the function to blocks, because the 'get' prefix is usually used
with functions retrieving state in a monad.

blocks :: Int -> [a] -> [[a]]
blocks size [] = []
blocks size bytes = block : blocks rest
where
(block, rest) = splitAt size bytes

> analyzeContents bytes = [ processBlock block | block <- (getBlocks bytes) ]

You may have heard of the sin to overuse sugar. It makes your code fat
and in bad cases incomprehensible. Better use higher order functions
here, because as soon as you've got used to them, they express much more
clearly what's going on.

analyzeContents :: String -> String
analyzeContents = map processBlock . blocks 2048

> main = do
> arglist <- System.Environment.getArgs
> let fname = head arglist
> bytes <- readFile fname
> let pixels = analyzeContents bytes
> -- TBW
> print "Done."

I suppose you haven't read a lot about monads yet, so I'll limit my
critique to the substance. Firstly you can use pattern matching in
do-notation, so there is really no need to use the 'head' function.
Secondly, it's better not to introduce names, where none are really
necessary. If you would like to tell the programmer to interpret the
result of 'analyzeContents' as pixel data, better do that in a comment.

Finally, there is nothing wrong with using 'print', but it prints string
representations of values, which may include syntax like quotes, while
putStrLn simply prints a string (and a line feed).

main :: IO ()
main = do
(fname:_) <- getArgs
bytes <- readFile fname
putStrLn (analyzeContents bytes)
putStrLn "Done."

Greets,
Ertugrul.

--
nightmare = unsafePerformIO (getWrongWife >>= sex)
http://blog.ertes.de/

Francesco Bochicchio

unread,

Jan 12, 2009, 1:41:03 PM1/12/09

to

On Sun, 11 Jan 2009 21:37:53 +0100, Ertugrul Söylemez wrote:

> Francesco Bochicchio <boc...@virgilio.it> wrote:
>
>> I'm trying to learn some haskell by giving myself little tasks. One
>> such tasks (borrowed from another newsgroup) requires to read a huge
>> file (2GB) and process it in chunks of 2KB.
>>
>> [...]
>
> Besides lots of little beginner's typical style glitches, I don't see
> any problem with your code. In fact, with a few changes to make it
> compile, it works well here. I'm using GHC 6.10.1 on Linux, but I think
> that it should work equally well with every other GHC. Hugs failed,
> because I used a binary file as input, suggesting to use binary I/O.
>

[...]

Thanks for all the suggestions: it was only a quick test (less quick that
I planned to, since I can't get to work :-( ) but I guess is never too
soon to learn good habits, in case I ever do real work with haskell.

My original program (mapfile.hs), of which I posted a sligtly simplified
(and incorrect) version, works fine with small and medium-sized files,
but choked on the 2GB file after then minutes, with the following message
(I use gch 6.6 on OS/X):

mapfile: memory allocation failed (requested 2097152 bytes)

This lead me to the idea that maybe I was not doing 'lazy I/O'
correctly ' and the program tried to read all the file in memory
before processing it. Now, I'm not sure of it anymore, and I will look
at the parts of the program that I omitted (which basically is the output
side, which should write a much smaller file of about 10MB ).

Your version of my simplified code has another problem with the 2GB file:

Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize' to increase it.

I also tried a strict IO version of this test, using ByteString.hGet
to read each time a chunk of file and process it before reading the
next one, and it fails with the same error on the 2GB file.

In another language, I would think this is a case of too many recursion
levels. But in haskell, which uses recursion as replacemet for iteration?
Does it mean tha there is a maximum number of recursion levels and
therefore of 'loop cycles' ?

Thanks again for all your suggestion. I tend to like my programs a bit
more sugary than you do, but is matter of taste, right? :-)

>
>
> Greets,
> Ertugrul.

Ciao
---
FB

Ertugrul Söylemez

unread,

Jan 12, 2009, 11:44:19 PM1/12/09

to

Francesco Bochicchio <boc...@virgilio.it> wrote:

> My original program (mapfile.hs), of which I posted a sligtly
> simplified (and incorrect) version, works fine with small and
> medium-sized files, but choked on the 2GB file after then minutes,
> with the following message (I use gch 6.6 on OS/X):
>
> mapfile: memory allocation failed (requested 2097152 bytes)
>
> This lead me to the idea that maybe I was not doing 'lazy I/O'
> correctly ' and the program tried to read all the file in memory
> before processing it. Now, I'm not sure of it anymore, and I will look
> at the parts of the program that I omitted (which basically is the
> output side, which should write a much smaller file of about 10MB ).

Perhaps make sure you're compiling with optimizations enabled (the -O or
-O2 flags). I don't think this is the cause, but try it. If you get
out of memory, it's probably related to strictness.

> Your version of my simplified code has another problem with the 2GB
> file:
>
> Stack space overflow: current size 8388608 bytes.
> Use `+RTS -Ksize' to increase it.
>
> I also tried a strict IO version of this test, using ByteString.hGet
> to read each time a chunk of file and process it before reading the
> next one, and it fails with the same error on the 2GB file.
>
> In another language, I would think this is a case of too many
> recursion levels. But in haskell, which uses recursion as replacemet
> for iteration? Does it mean tha there is a maximum number of
> recursion levels and therefore of 'loop cycles' ?

Firstly no, if you use recursion properly, there is no limit. In
non-monadic functions, make sure that the recursive call is the last
operation. This makes the function tail-recursive and thus equivalent
to a loop. GHC detects this and removes the recursive calls
accordingly. The usual factorial function is not tail-recursive,
because after the recursion, a multiplication is carried out:

factorial 0 = 1
factorial n = n * factorial (n-1)

You can make it tail-recursive by adding an accumulation parameter:

factorial' p 0 = p
factorial' p n = factorial (p*n) (n-1)

factorial = factorial' 1

In monadic code, make sure that the recursive call is the final
computation, i.e. rightmost in the binding sequence.

However, explicit recursion is usually worked around. There are plenty
of ways to do that, most notably monad functions from Control.Monad and
lists.

> Thanks again for all your suggestion. I tend to like my programs a bit
> more sugary than you do, but is matter of taste, right? :-)

As long as it's readable, yes. ;)

Ertugrul Söylemez

unread,

Jan 13, 2009, 10:47:02 AM1/13/09

to

Ertugrul Söylemez <e...@ertes.de> wrote:

> You can make it tail-recursive by adding an accumulation parameter:
>
> factorial' p 0 = p
> factorial' p n = factorial (p*n) (n-1)
>
> factorial = factorial' 1

I just found that there is a little, but significant typo. In the
second line, replace "factorial" by "factorial'", i.e. with a single
quote appended.

Francesco Bochicchio

unread,

Jan 14, 2009, 2:00:16 PM1/14/09

to

On Sun, 11 Jan 2009 13:33:22 +0100, Francesco Bochicchio wrote:

>
> Hi list,
>
> I'm trying to learn some haskell by giving myself little tasks.
> One such tasks (borrowed from another newsgroup) requires to read a huge
> file (2GB) and process it in chunks of 2KB.
> I was trying to do it using a lazy non-imperative style,
> and rather than reading one block at the time and process it, I wrote the
> following code to build a list of blocks and process it. But it seems that
> this code stil causes all file contents to be loaded in memory, because
> the program dies for an out of memory error (after running for about 10
> minutes).
>

<...>

I finally got my program to work with 2GB files, by making tail-recursive
functions and using largely ByteStrings both for processing (lazily) the
input file and to accumulate the resulting bytes. On my PC ( iMac Intel,
Tiger, ghc 6.6 ) it takes about 3 minutes. I guess it could be made
faster, but surely not by me :-)

As a side effect of this, I learned a few gotcha about haskell, which
could be useful if I ever do serious programming in this language. For
now, I might toy a little more with it, but w/out hurring ...

Ciao
---
FB

Francesco Bochicchio

unread,

Jan 14, 2009, 2:45:35 PM1/14/09

to

On Wed, 14 Jan 2009 20:00:16 +0100, Francesco Bochicchio wrote:

> On Sun, 11 Jan 2009 13:33:22 +0100, Francesco Bochicchio wrote:
>
>
> <...>
>
> I finally got my program to work with 2GB files, by making tail-recursive
> functions and using largely ByteStrings both for processing (lazily) the
> input file and to accumulate the resulting bytes. On my PC ( iMac Intel,
> Tiger, ghc 6.6 ) it takes about 3 minutes. I guess it could be made
> faster, but surely not by me :-)
>

As last act of this monologue, I just want to add that the strict IO
version of the same program, done in almost-imperative fashion ( a
tail-recursive IO action with does read block; process block; write
results; in sequence ) takes only 42 seconds, of which only 6 of User Time.

So, sometime lazy is really bad, also for functional programming ...

Ciao
-----
FB

Ertugrul Söylemez

unread,

Jan 14, 2009, 7:06:29 PM1/14/09

to

Francesco Bochicchio <boc...@virgilio.it> wrote:

> As last act of this monologue, I just want to add that the strict IO
> version of the same program, done in almost-imperative fashion ( a
> tail-recursive IO action with does read block; process block; write
> results; in sequence ) takes only 42 seconds, of which only 6 of User
> Time.
>
> So, sometime lazy is really bad, also for functional programming ...

Sure, laziness is useful, but not the ultimate solution, and if you read
block by block anyway, there is no reason to do this lazily.

Dirk Thierbach

unread,

Jan 15, 2009, 4:31:43 AM1/15/09

to

Francesco Bochicchio <boc...@virgilio.it> wrote:
> My original program (mapfile.hs), of which I posted a sligtly simplified
> (and incorrect) version, works fine with small and medium-sized files,
> but choked on the 2GB file after then minutes, with the following message
> (I use gch 6.6 on OS/X):
>
> mapfile: memory allocation failed (requested 2097152 bytes)

Could you check if this also happens if the file size is slightly less
than 2GB? If it doesn't happen in this case, and happens as soon as the
size gets greater than 2GB, then this smells suspicously like a
signed 32bit value used somewhere (maybe temporarily) to hold the
file position. That may be related to the OS/X implementation, it's
soemtimes not so easy to get integer sizes right accross different
unix systems.

If that really is the reason, you may have hit a bug in the library;
in that case, you'll probably have to contact the GHC developers.
Though it may be a good idea to try the same program with the most
recent GHC version first.

- Dirk

Dirk Thierbach

unread,

Jan 15, 2009, 4:42:02 AM1/15/09

to

It's hard to say what impacts the performance without seeing the full
program; sometimes being too lazy or too strict in some auxialiary
function affects the performance greatly. Optimizing by the compiler
may also affect the performance to a large degree, so if you really
want to investigate which style is "best" (as far as one can judge
this from toy examples which don't do any real processing), you
should look at the intermediate code generated by the compiler after
optimization. And then there is cache usage, which usually favours
imperative programs that use the same memory all the time.

Unfortunately, optimizing Haskell programs is a bit of a black art,
and Haskell's strengths are certainly not in the area of fast processing
of large amounts of data (though ByteStrings really help), but more
in the area of implementing complicated algorithms correctly (and
quickly).

In general, saying "lazy is bad" is definitely overgeneralization;
sometimes being too strict can also be bad (search the Haskell mailing
list for some examples).

- Dirk

Francesco Bochicchio

unread,

Jan 16, 2009, 2:29:42 PM1/16/09

to

On Thu, 15 Jan 2009 10:42:02 +0100, Dirk Thierbach wrote:

>
> Unfortunately, optimizing Haskell programs is a bit of a black art,
> and Haskell's strengths are certainly not in the area of fast processing
> of large amounts of data (though ByteStrings really help), but more
> in the area of implementing complicated algorithms correctly (and
> quickly).
>

Optimizing haskell programs might be a black art but to me is still less
obscure than monads :-)

I have been able to get a pair of good hints on how make haskell programs
faster and/or less memory hungry:

- I understand why to use tail recursion, and actually it is relatively
easy to use the accumulation tecnique to make a tail-recursive function
out of a simply recursive one.

- After some reading on haskell wiky, I also understand thunks and why
delaying too much the evaluation may lead to memory exhaustion (I don't
get why in haskell-ese it is often called memory leak: it is not that
memory is wasted or lost: it is just that the program need to much of
it).

I just whish that in the various haskell tutorials I consulted, those
hints were more prominently placed ... although this is partially my
fault since I use to spot-read documentation and tutorials and want to
try coding as soon as possible.

> In general, saying "lazy is bad" is definitely overgeneralization;
> sometimes being too strict can also be bad (search the Haskell mailing
> list for some examples).
>

I don't think "lazy is bad". On the contrary, I used to think that lazy
evaulation was the 'natural way of being' for pure functional languages.
Now I sort of realize that compilers are not (yet?) up to the task to
convert your 'functional dreams' into efficient programs, and you have to
help with hints and directions sprinkled in your code.

> - Dirk

Ciao
---
FB

grimey

unread,

Jan 16, 2009, 3:08:21 PM1/16/09

to

On Jan 16, 12:29 pm, Francesco Bochicchio <bock...@virgilio.it> wrote:
> On Thu, 15 Jan 2009 10:42:02 +0100, Dirk Thierbach wrote:

>
> - After some reading on haskell wiky, I also understand thunks and why
> delaying too much the evaluation may lead to memory exhaustion (I don't
> get why in haskell-ese it is often called memory leak: it is not that
> memory is wasted or lost: it is just that the program need to much of
> it).
>

Yeah, me too! I think it should be called memory hoarding.
-Chip

Dirk Thierbach

unread,

Jan 17, 2009, 8:08:12 AM1/17/09

to

Francesco Bochicchio <boc...@virgilio.it> wrote:
> Optimizing haskell programs might be a black art but to me is still less
> obscure than monads :-)
>
> I have been able to get a pair of good hints on how make haskell programs
> faster and/or less memory hungry:

Yes, tail recursion and avoiding a built-up of thunks by lazyness are
some of the obvious ones, but it just starts there.

> (I don't get why in haskell-ese it is often called memory leak: it
> is not that memory is wasted or lost: it is just that the program
> need to much of it).

I completely agree that it is a misnomer, but for some reason people
insist on using that expression.

> I just whish that in the various haskell tutorials I consulted, those
> hints were more prominently placed ...

Maybe give the authors a hint :-)

> I don't think "lazy is bad". On the contrary, I used to think that lazy
> evaulation was the 'natural way of being' for pure functional languages.
> Now I sort of realize that compilers are not (yet?) up to the task to
> convert your 'functional dreams' into efficient programs, and you have to
> help with hints and directions sprinkled in your code.

Yep.

- Dirk