Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[Caml-list] best and fastest way to read lines from a file?

189 views
Skip to first unread message

YC

unread,
Oct 1, 2007, 5:28:24 PM10/1/07
to caml...@yquem.inria.fr
Hi all -

Newbie question: I'm wondering what's the most efficient way to read in a
file line by line? I wrote a routine in both python and ocaml to read in a
file with 345K lines to do line count and was surprised that python's code
run roughly 3x faster.

I thought the speed should be equivalent and/or somewhat in ocaml favor,
given this is an IO-bound comparison, but perhaps Python's simplistic for
loop have a read-ahead buffer built-in, and perhaps ocaml's input channel is
unbuffered, but I'm not sure how to write a buffered code that can do a line
by line read-in.

Any insight is appreciated, thanks ;)

yc

Python code:
# test.py
#!/usr/bin/python

file = <345k-line.txt>
count = 0
for line in open (file, "r"):
count = count + 1
print "Done: ", count

OCaml code:
(* test.ml *)
let rec line_count filename =
let f = open_in filename in
let rec loop file count =
try
ignore (input_line file);
loop file (count + 1)
with
End_of_file -> count
in
loop f 0;;

let count = line_count <345k-line.txt> in
Printf.printf "Done: %d" count;;

Test
$ time ./test.py
Done: 345001

real 0m0.416s
user 0m0.101s
sys 0m0.247s

$ ocamlopt -o test test.ml
$ time ./test
Done: 345001
real 0m1.483s
user 0m0.631s
sys 0m0.685s

ash

unread,
Oct 1, 2007, 5:53:11 PM10/1/07
to
Try this:

(* test.ml *)
let lines fname =
let cin = open_in fname in
let f _ =
try Some (input_line cin)
with End_of_file -> None
in
Stream.from f

let count = ref 0
let strm = lines "your-input-file.txt"
let _ = Stream.iter (fun _ -> incr count) strm
let _ = print_endline ("Done: " ^ (string_of_int !count))

On my machine (using some other file I had with 637114 lines)...
$ time ./test.py
Done: 637114

real 0m0.509s
user 0m0.412s
sys 0m0.096s

$ ocamlopt test.ml
$ time ./a.out
Done: 637114

real 0m0.305s
user 0m0.236s
sys 0m0.067s

> _______________________________________________
> Caml-list mailing list. Subscription management:http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives:http://caml.inria.fr
> Beginner's list:http://groups.yahoo.com/group/ocaml_beginners
> Bug reports:http://caml.inria.fr/bin/caml-bugs


Daniel Bünzli

unread,
Oct 1, 2007, 5:55:46 PM10/1/07
to caml-list List

Le 1 oct. 07 à 23:27, YC a écrit :

> OCaml code:
> (* test.ml *)
> let rec line_count filename =
> let f = open_in filename in
> let rec loop file count =
> try
> ignore (input_line file);
> loop file (count + 1)
> with
> End_of_file -> count
> in
> loop f 0;;

Your function is not tail recursive. A function is tail-recursive if
there is nothing to do after the recursive call. You might believe
this is the case in your function, but it is not the case because of
the exception handler. Try this instead :

let rec line_count filename =

let ic = open_in filename in
let rec loop ic count =
let line = try Some (input_line ic) with End_of_file -> None in
match line with
| Some _ -> loop ic (count+1)
| None -> count
in
loop ic 0

It should run faster.

Daniel

Olivier Roussel

unread,
Oct 1, 2007, 5:56:50 PM10/1/07
to YC, caml...@yquem.inria.fr
YC a écrit :
> Hi all -
Hi!
> OCaml code:
> (* test.ml <http://test.ml> *)

> let rec line_count filename =
> let f = open_in filename in
> let rec loop file count =
> try
> ignore (input_line file);
> loop file (count + 1)
> with
> End_of_file -> count
> in
> loop f 0;;
>
> let count = line_count <345k-line.txt> in
> Printf.printf "Done: %d" count;;
The following solution is ~2.5x faster than the Python implementation on
my computer. Because there is no more exceptions in recursive calls, and
thanks to tail-recursion.

let readline f =
try Some (input_line f)
with End_of_file -> None;;

let line_count filename =


let f = open_in filename in

let rec loop count =
match (readline f) with
| Some(_) -> loop (count+1)
| None -> count in
loop 0;;

let count = line_count <345k-line.txt> in

Printf.printf "Done: %d\n" count;;


--
Olivier Roussel

YC

unread,
Oct 1, 2007, 6:29:50 PM10/1/07
to Daniel Bünzli, caml-list List
Thanks to everyone replied - your answers are extremely helpful.

> Your function is not tail recursive. A function is tail-recursive if
> there is nothing to do after the recursive call. You might believe
> this is the case in your function, but it is not the case because of
> the exception handler. Try this instead :

I did wonder whether my function was tail recursive but thought maybe it
is. Thanks for the correction from everyone on this point. And the pattern
of returning Some or None makes sense to me now. Rerun the test makes me
feel better ;)

Thanks,
yc

Mattias Engdegård

unread,
Oct 2, 2007, 8:40:54 AM10/2/07
to olivier...@crans.org, caml...@yquem.inria.fr
>let line_count filename =
> let f = open_in filename in
> let rec loop count =
> match (readline f) with
> | Some(_) -> loop (count+1)
> | None -> count in
> loop 0;;

Something like this should be even faster:

exception Done of int

let line_count file =


let rec loop count =

let _ =
try
input_line f
with End_of_file -> raise (Done count)
in
loop (count + 1)
in
try loop 0 with Done x -> x

as it avoids unnecessary consing in the inner loop.

Brian Hurt

unread,
Oct 2, 2007, 8:57:29 AM10/2/07
to caml...@yquem.inria.fr
Mattias Engdegård wrote:

>>let line_count filename =
>> let f = open_in filename in
>> let rec loop count =
>> match (readline f) with
>> | Some(_) -> loop (count+1)
>> | None -> count in
>> loop 0;;
>>
>>
>
>Something like this should be even faster:
>
>exception Done of int
>
>let line_count file =
> let rec loop count =
> let _ =
> try
> input_line f
> with End_of_file -> raise (Done count)
> in
> loop (count + 1)
> in
> try loop 0 with Done x -> x
>
>as it avoids unnecessary consing in the inner loop.
>
>__
>

This should be a FAQ.

let line_count file =
let rec loop count =

let test =
try
let _ = input_line f in
true
with
| Not_found -> false
in
if test then
loop (count + 1)
else
count
in
loop 0
;;

No consing, no unnecessary try/catch, tail recursive.

Brian

kirillkh

unread,
Oct 2, 2007, 12:16:54 PM10/2/07
to Brian Hurt, caml...@yquem.inria.fr
> This should be a FAQ.
>

Since we're talking of 10+ lines of code and only one case among many
possible (you might also want to do something fairly similar, but not quite
the same, as iterating over all words or characters in a file, doing
something else than counting, etc.), I would rather see it implemented in a
library as combinator. What I have in mind is a function that goes over a
file and invokes some user code on each block of
bytes/characters/lines/words/... The points of customization would be:
* how to detect the start and end of block
* routine to pass the blocks to

Then, on top of this combinator, build block-specific ones: for byte, char,
line, word blocks. Also make it possible to customize buffering behavior.

Being new to OCaml, I'm interested in comments - is what I suggest a good
idea? If yes, why hasn't anyone implemented it yet? One could argue that
it's trivial and can be implemented in each use case anew, but among 5
different solutions posted so far, each has its own problems, besides the
supposedly ideal 5-th -- I'd take this as an indication that, although
simple, it's not really trivial to write this thing.

skaller

unread,
Oct 2, 2007, 3:36:56 PM10/2/07
to kirillkh, caml...@yquem.inria.fr
On Tue, 2007-10-02 at 20:02 +0200, kirillkh wrote:
> Replying to a private mail from Brian:

> (* I couldn't figure out, how to declare a polymorphic exception
> properly *)
> exception Done of 'a

That's easy -- you can't: even if you could, how could
you possibly use it?

This compiles fine:

type t = { field : 'a. 'a }
exception Done of t

but 'field' is useless. This is not at all the same as

let f (x:'a) (g:'a -> int) =
match g x with
| 0 -> ..
| ..

because *inside* the function, 'a is not a type variable,
and the code is not polymorphic, it is simply a sole
unknown type, sometimes said to be monomorphised.

The problem with exceptions is that they're not captured,
so they cannot be polymorphic. Exceptions SUCK because
their context is not delimited -- you can throw all the way
out of the mainline .. :)

[This happens to me regularly and it can takes days to figure
out what is Not_found ..]

--
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

Olivier Andrieu

unread,
Oct 2, 2007, 4:24:37 PM10/2/07
to kirillkh, caml...@yquem.inria.fr
On 10/2/07, kirillkh <kiri...@gmail.com> wrote:
> OK, so I'll give up the parsing/buffering part and only leave efficient
> exception handling. This should leave the user free to do anything with it,
> but prevent performance pitfalls. The following is based on Mattias
> Engdegard's code:

>
> (* I couldn't figure out, how to declare a polymorphic exception properly *)
> exception Done of 'a
>
> let fold_file (file: in_channel)
> (read_func: in_channel->'a)
> (elem_func: 'a->'b->'b)
> (seed: 'b) =
> let rec loop prev_val =
> let input =
> try read_func file
> with End_of_file -> raise (Done prev_val)
> in
> let combined_val = elem_func input prev_val in
> loop combined_val
> in
> try loop seed with Done x -> x
>
> And the usage for line counting:

>
> let line_count filename =
> let f = open_in filename in
> let counter _ count = count + 1 in
> fold_file f readline counter 0
>
> Since it's library code, we can tolerate the little annoyance of the second
> try-catch. As far as I can tell, this code has the same performance
> characteristics as yours: no consing + tail recursion. Any other problems
> with it?

well apart from the fact that you cannot have "polymorphic exceptions"
in OCaml, this kind of code is IMHO much more natural with an
imperative loop instead of a functional one:


let fold_file read chan f init =
let acc = ref init in
begin
try while true do
let d = read chan in
acc := f d !acc
done
with End_of_file -> ()
end ;
!acc

--
Olivier

kirillkh

unread,
Oct 2, 2007, 5:05:39 PM10/2/07
to skaller, caml...@yquem.inria.fr
2007/10/2, skaller <ska...@users.sourceforge.net>:

>
> On Tue, 2007-10-02 at 20:02 +0200, kirillkh wrote:
> > Replying to a private mail from Brian:
>
> > (* I couldn't figure out, how to declare a polymorphic exception
> > properly *)
> > exception Done of 'a
>
> That's easy -- you can't: even if you could, how could
> you possibly use it?
>
> This compiles fine:
>
> type t = { field : 'a. 'a }
> exception Done of t
>
> but 'field' is useless. This is not at all the same as
>
> let f (x:'a) (g:'a -> int) =
> match g x with
> | 0 -> ..
> | ..
>
> because *inside* the function, 'a is not a type variable,
> and the code is not polymorphic, it is simply a sole
> unknown type, sometimes said to be monomorphised.
>
> The problem with exceptions is that they're not captured,
> so they cannot be polymorphic. Exceptions SUCK because
> their context is not delimited -- you can throw all the way
> out of the mainline .. :)
>
> [This happens to me regularly and it can takes days to figure
> out what is Not_found ..]


Is there a way to instantiate an exception with a value of unspecified type
and then do explicit casting on catch?

Is it a deficiency in the language? I suppose OCaml could support
polymorphic exceptions, if they were checked, like in Java, and appeared in
function signatures in a similar way to parameters and return values.

David Allsopp

unread,
Oct 2, 2007, 5:17:14 PM10/2/07
to kirillkh, caml...@yquem.inria.fr
> A little weird to see such inherently functional construct as fold
implemented imperatively. But it's fine with me, as long as it does the job.
I
> wonder, though, how would the performance of a line counter based on your
code compare with the one suggested by Brian.

It's faster, though only slightly - I was going to post an imperative
solution earlier, too. Even in Ocaml, a recursive call (tail or otherwise)
has some costs not present in a simple loop. However, I don't agree that
it's more natural - "ugly" is more the word I'd use!


David

Jon Harrop

unread,
Oct 2, 2007, 5:17:58 PM10/2/07
to caml...@yquem.inria.fr
On Tuesday 02 October 2007 22:05:07 kirillkh wrote:
> Is it a deficiency in the language?

A constraint of the type system by design.

> I suppose OCaml could support
> polymorphic exceptions, if they were checked, like in Java, and appeared in
> function signatures in a similar way to parameters and return values.

The exn type would need an undefined number of type parameters.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?e

Jon Harrop

unread,
Oct 2, 2007, 5:20:23 PM10/2/07
to caml...@yquem.inria.fr
On Tuesday 02 October 2007 21:49:41 kirillkh wrote:
> A little weird to see such inherently functional construct as fold
> implemented imperatively.

On the contrary, this is the core functionality of ML that makes it so
practically useful: combining functional and imperative programming neatly
and efficiently.

Look at the Array and Hashtbl modules of the stdlib for more examples of code
that is both imperative and functional.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?e

_______________________________________________

skaller

unread,
Oct 2, 2007, 6:24:08 PM10/2/07
to David Allsopp, caml...@yquem.inria.fr
On Tue, 2007-10-02 at 22:15 +0100, David Allsopp wrote:
> > A little weird to see such inherently functional construct as fold
> implemented imperatively. But it's fine with me, as long as it does the job.
> I
> > wonder, though, how would the performance of a line counter based on your
> code compare with the one suggested by Brian.
>
> It's faster, though only slightly - I was going to post an imperative
> solution earlier, too. Even in Ocaml, a recursive call (tail or otherwise)
> has some costs not present in a simple loop. However, I don't agree that
> it's more natural - "ugly" is more the word I'd use!

In Felix, tail-rec calls are generally faster than loops.
The reason is that the compiler is better able to reason
about the semantics and so it can do better optimisations.
OTOH loops degenerate to conditional goto-spaghetti and
would require SSA analysis to recover.

For example, given:

fun f(x,y) ... f (a,b)

the implementation generates

var x,y;
set_initial_values(x,y);
start:
...
paropt x,y = a,b;
goto start;

where 'paropt' serialises the parallel assignment
using an optimisation algorithm which tries to minimise
the number of assignments and temporaries used.

this has a significant impact on performance .. Felix version
of Ackermann's function is more then 2.5x faster than Ocamlopt,
and almost 2x faster than C. Ackermann has 2 tail-rec calls.

Trying to do this with loops is much harder because, unlike
the tail-rec formulation, the 'inputs' and 'outputs' connected
by the loops are not explicit, as they are when a function
is used and parameters and arguments define the input/output
relationship (and everything else local is a temporary with no
persistence: Felix doesn't not allow functions to have side
effects, so only local data is mutable).

--
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

_______________________________________________

0 new messages