Is there an *efficient* way to get a file line count in tcl?

3339 views
Skip to first unread message

Jeff David

unread,
Aug 31, 1999, 3:00:00 AM8/31/99
to
I need a cross-platform way to get a count of lines in a file (thus eliminating
anything like "exec wc -l"). This proc works, but is excruciatingly slow on
large files:

proc linecount {file} {
set i 0
set fid [open $file r]
while {[gets $fid line] > -1} {incr i}
close $fid
return i
}

For a 1,000,000 line file this takes almost a minute to run on an Ultra 1.
wc -l takes 4-5 seconds. Increasing the buffersize using fconfigure also
does not help (nor would I expect it to in this case).

The line lengths are not equal so I cannot consider doing something like
expr {[file size $file] / $linesize}.

Any ideas?

Jeff David
jld...@lucent.com

Bob Techentin

unread,
Aug 31, 1999, 3:00:00 AM8/31/99
to

Jeff,

You are suffering from a small read buffer. It is *much* faster if you
do one of two things:

1. Increase the file buffer size (to, say, a million bytes):

set fid [open $file r]

fconfigure $fid -buffersize 1000000
...

2. Slurp up the whole file in a single, optimized, read,
and use 'split' to separate the data into lines:

set fid [open $file r]

set data [read $fid [file size $file]]
set lineCount [llength [split $data "\n"]]

Bob

--
Bob Techentin techenti...@mayo.edu
Mayo Foundation (507) 284-2702
Rochester MN, 55905 USA http://www.mayo.edu/sppdg/sppdg_home_page.html

Jeff David

unread,
Aug 31, 1999, 3:00:00 AM8/31/99
to
Bob Techentin wrote:
>
> Jeff David wrote:


> You are suffering from a small read buffer. It is *much* faster if you
> do one of two things:
>
> 1. Increase the file buffer size (to, say, a million bytes):
>
> set fid [open $file r]
> fconfigure $fid -buffersize 1000000
> ...

First, I tried this with the procedure I posted and, as I originally
stated,
it did no good whatsoever. The times were identical regardless of the
buffer
size. When doing what you suggest below, it is certainly more efficient
to use large buffer sizes. I'm not sure you gain anything if you are
using
gets on 30-40 character lines.

>
> 2. Slurp up the whole file in a single, optimized, read,
> and use 'split' to separate the data into lines:
>
> set fid [open $file r]
> set data [read $fid [file size $file]]
> set lineCount [llength [split $data "\n"]]
>

Well, I just tried this using a buffersize of 1,000,000. The time to
compute the number of lines INCREASED to 1 minute and 25 seconds from
about 58 seconds in the original proc.

Any other ideas out there?

Jeff David

Keith Lea

unread,
Aug 31, 1999, 3:00:00 AM8/31/99
to
Jeff David <jldavid@REMOVE_THISlucent.com> wrote in message
news:37CC513E.16C1@REMOVE_THISlucent.com...

> > 2. Slurp up the whole file in a single, optimized, read,
> > and use 'split' to separate the data into lines:
> >
> > set fid [open $file r]
> > set data [read $fid [file size $file]]
> > set lineCount [llength [split $data "\n"]]
> >
> Well, I just tried this using a buffersize of 1,000,000. The time to
> compute the number of lines INCREASED to 1 minute and 25 seconds from
> about 58 seconds in the original proc.
>
> Any other ideas out there?
>
> Jeff David

Try, instead of using [split], doing this

set lineCount [regsub -all \n $data {} data]

-kl

Donal K. Fellows

unread,
Sep 8, 1999, 3:00:00 AM9/8/99
to
In article <37CC43DC.11FC@REMOVE_THISlucent.com>,

Jeff David <jldavid@REMOVE_THISlucent.com> wrote:
> For a 1,000,000 line file this takes almost a minute to run on an Ultra 1.
> wc -l takes 4-5 seconds. Increasing the buffersize using fconfigure also
> does not help (nor would I expect it to in this case).
>
> The line lengths are not equal so I cannot consider doing something like
> expr {[file size $file] / $linesize}.

Hmm. A million lines is a lot of data, and so we want to try to avoid
copying it as much as we can (since copying is *very*slow* when done
loads of times.) Working in large chunks is good too (since syscalls
are slow too.)

proc linecount {file {eofchar "\n"}} {
set i 0
set fid [open $file]
# Use a 512K buffer.
fconfigure $fid -buffersize 524288 -translation binary
while {![eof $fid]} {
incr i [regsub -all $eofchar [read $fid 524288] $eofchar junk]
}
close $fid
return $i
}

Note that when the input data is coming from a Mac, you will need to
pass eofchar as "\r" instead of "\n", since I disable the normal
conversion stuff for speed. The above code seems to be substantially
faster than your version, though not quite as fast as [exec wc -l].
On a handy 4MB/100kLine file (with a very wide mix of line lengths),
my timings (Sun Ultra 5/Solaris 2.7/Tcl 8.0.4 with data coming over
NFS) were:

% time {linecount $testfile}
3883065 microseconds per iteration
% time {linecount2 $testfile}
565478 microseconds per iteration
% time {exec wc -l $testfile}
472118 microseconds per iteration

I don't know about speeds with later versions, since I've not upgraded
yet...

The main tunable parameter is how much data to slurp in at once. That
is probably best set to something that is a multiple of the filesystem
chunk size (e.g. the size of a cluster on FAT systems) and 512KB fits
the bill fairly well while not taking too much memory. Larger sizes
are potentially faster (and slurping in the whole lot in one call is
ideal) but you've probably got too much data for that to be a serious
proposition.

Donal.
--
Donal K. Fellows http://www.cs.man.ac.uk/~fellowsd/ fell...@cs.man.ac.uk
-- The small advantage of not having California being part of my country would
be overweighed by having California as a heavily-armed rabid weasel on our
borders. -- David Parsons <o r c @ p e l l . p o r t l a n d . o r . u s>

Chang LI

unread,
Sep 8, 1999, 3:00:00 AM9/8/99
to
Donal K. Fellows wrote:
>
> In article <37CC43DC.11FC@REMOVE_THISlucent.com>,
> Jeff David <jldavid@REMOVE_THISlucent.com> wrote:
> > For a 1,000,000 line file this takes almost a minute to run on an Ultra 1.
> > wc -l takes 4-5 seconds. Increasing the buffersize using fconfigure also
> > does not help (nor would I expect it to in this case).
> >
> > The line lengths are not equal so I cannot consider doing something like
> > expr {[file size $file] / $linesize}.
>
> Hmm. A million lines is a lot of data, and so we want to try to avoid
> copying it as much as we can (since copying is *very*slow* when done
> loads of times.) Working in large chunks is good too (since syscalls
> are slow too.)
>

Is it fast to just compare a byte rather than using regsub?

> proc linecount {file {eofchar "\n"}} {
> set i 0
> set fid [open $file]
> # Use a 512K buffer.
> fconfigure $fid -buffersize 524288 -translation binary
> while {![eof $fid]} {
> incr i [regsub -all $eofchar [read $fid 524288] $eofchar junk]
> }
> close $fid
> return $i
> }
>

>

> Donal.
> --
> Donal K. Fellows http://www.cs.man.ac.uk/~fellowsd/ fell...@cs.man.ac.uk
> -- The small advantage of not having California being part of my country would
> be overweighed by having California as a heavily-armed rabid weasel on our
> borders. -- David Parsons <o r c @ p e l l . p o r t l a n d . o r . u s>

--
--------------------------------------------------------------
Chang LI, Neatware
email: cha...@neatware.com
web: http://www.neatware.com
--------------------------------------------------------------

Heribert Dahms

unread,
Sep 9, 1999, 3:00:00 AM9/9/99
to
In <7r5aju$8uo$1...@m1.cs.man.ac.uk> fell...@cs.man.ac.uk writes:

: Note that when the input data is coming from a Mac, you will need to


: pass eofchar as "\r" instead of "\n", since I disable the normal
: conversion stuff for speed.

Some nitpicking: That should be eolchar instead of eofchar!


Bye, Heribert (da...@ifk20.mach.uni-karlsruhe.de)

donal_...@my-deja.com

unread,
Sep 9, 1999, 3:00:00 AM9/9/99
to
In article
<7r6o6e$s7f$1...@news.rz.uni-karlsruhe.de>,
DA...@ifk20.mach.uni-karlsruhe.de (Heribert

Dahms) wrote:
> Some nitpicking: That should be eolchar instead
of eofchar!

Fixed in the Wiki version.

Donal.


Sent via Deja.com http://www.deja.com/
Share what you know. Learn what you don't.

Donal K. Fellows

unread,
Sep 9, 1999, 3:00:00 AM9/9/99
to
In article <37D6A3...@neatware.com>, Chang LI <cha...@neatware.com> wrote:
> Is it fast to just compare a byte rather than using regsub?

Do you know a faster way (than [regsub -all], that is) to count the
instances of a particular character in a string? Note that [string
first] is definitely not faster since it means you have to implement
the loop over the buffer in Tcl (which is done in C with the regsub
version...)

Jeff David

unread,
Sep 9, 1999, 3:00:00 AM9/9/99
to
Donal K. Fellows wrote:
>
> Hmm. A million lines is a lot of data, and so we want to try to avoid
> copying it as much as we can (since copying is *very*slow* when done
> loads of times.) Working in large chunks is good too (since syscalls
> are slow too.)
>
> proc linecount {file {eofchar "\n"}} {
> set i 0
> set fid [open $file]
> # Use a 512K buffer.
> fconfigure $fid -buffersize 524288 -translation binary
> while {![eof $fid]} {
> incr i [regsub -all $eofchar [read $fid 524288] $eofchar junk]
> }
> close $fid
> return $i
> }
>
Cute. However, on my Sparc Ultra 1 this is no faster than my original
proc (but much faster than the other two solutions posted).

Here is what came up when I tested (linecount is my original proc,
linecount4 is the above proc) on a 1,000,000 line 34 MB file:

% time {linecount4 bigfile} 1
40008220 microseconds per iteration

% time {linecount bigfile} 1
39764359 microseconds per iteration

Both of these are very poor (by an order of magnitude) compared to doing
a straight "wc -l" (which I cannot do because of cross-platform requirements):

% time {catch {exec wc -l bigfile} lines} 1
3945380 microseconds per iteration

Thanks for trying.

Jeff David

Peter Spjuth

unread,
Sep 9, 1999, 3:00:00 AM9/9/99
to
"Donal K. Fellows" wrote:
> Do you know a faster way (than [regsub -all], that is) to count the
> instances of a particular character in a string?

That depends on what Tcl version you are using.
According to my measurements, regsub -all is fastest on 8.0.
With 8.1, and even more so with 8.2, regsub gets slow and
split+llength is faster.

Tests on Solaris, with a 100k lines file:

The first line is the regsub version, slighly optimised (having
an empty subSpec seems to be slightly faster).
The second is a split+llength version.
Both are attached below.

8.0p2
471877 microseconds per iteration
612888 microseconds per iteration
8.1.1
2519146 microseconds per iteration
1298369 microseconds per iteration
8.2.0
4536487 microseconds per iteration
1394101 microseconds per iteration

proc linecount {file {eofchar "\n"}} {
set i 0
set fid [open $file]
# Use a 512K buffer.
fconfigure $fid -buffersize 524288 -translation binary
while {![eof $fid]} {

incr i [regsub -all $eofchar [read $fid 524288] "" junk]
}
close $fid
return $i
}

proc linecount2 {file {eofchar "\n"}} {


set i 0
set fid [open $file]
# Use a 512K buffer.
fconfigure $fid -buffersize 524288 -translation binary
while {![eof $fid]} {

incr i [expr {[llength [split [read $fid 524288] $eofchar]] -
1}]
}
close $fid
return $i
}

/Peter

laurent....@cgi.ca

unread,
Sep 9, 1999, 3:00:00 AM9/9/99
to comp.l...@list.deja.com
On 9 Sep, Jeff David wrote:
> Cute. However, on my Sparc Ultra 1 this is no faster than my original
> proc (but much faster than the other two solutions posted).
>
> Here is what came up when I tested (linecount is my original proc,
> linecount4 is the above proc) on a 1,000,000 line 34 MB file:
>
> % time {linecount4 bigfile} 1
> 40008220 microseconds per iteration
>
> % time {linecount bigfile} 1
> 39764359 microseconds per iteration
>
> Both of these are very poor (by an order of magnitude) compared to doing
> a straight "wc -l" (which I cannot do because of cross-platform requirements):
>
> % time {catch {exec wc -l bigfile} lines} 1
> 3945380 microseconds per iteration
>
> Thanks for trying.
>

Hmmmm... That's weird. Donal's test showed that his code was much faster
than your original code, but a little slower than wc -l. Just a thought: are
you using Tcl 8.1? I know that the first 8.1 release was *really* slow for
string manipulation. Try 8.0.5 or 8.2.

L

--
Penguin Power! Nothing I say reflects the views of my employer

Laurent Duperval mailto:laurent....@cgi.ca
CGI - FWFM Project Phone: (514) 391-9523

Jeff David

unread,
Sep 9, 1999, 3:00:00 AM9/9/99
to
laurent....@cgi.ca wrote:
>
> Hmmmm... That's weird. Donal's test showed that his code was much faster
> than your original code, but a little slower than wc -l. Just a thought: are
> you using Tcl 8.1? I know that the first 8.1 release was *really* slow for
> string manipulation. Try 8.0.5 or 8.2.

I'm running 8.0.5 on a Sun Ultra 1 running Solaris 2.6. I don't understand
why I'm getting such radically different results either.

Jeff David

Chang LI

unread,
Sep 9, 1999, 3:00:00 AM9/9/99
to
Peter Spjuth wrote:
>

I am happy on your result for 8.2.0. threefold speedup!



> 8.0p2
> 471877 microseconds per iteration
> 612888 microseconds per iteration
> 8.1.1
> 2519146 microseconds per iteration
> 1298369 microseconds per iteration
> 8.2.0
> 4536487 microseconds per iteration
> 1394101 microseconds per iteration
>

> proc linecount2 {file {eofchar "\n"}} {


> set i 0
> set fid [open $file]
> # Use a 512K buffer.
> fconfigure $fid -buffersize 524288 -translation binary
> while {![eof $fid]} {

> # incr i [expr {[llength [split [read $fid 524288] $eofchar]] -
> 1}]
# it maybe a little bit faster with

incr i [llength [split [read $fid 524288] $eofchar]]
incr i -1

> }
> close $fid
> return $i
> }
>
> /Peter

--

Donal K. Fellows

unread,
Sep 10, 1999, 3:00:00 AM9/10/99
to
In article <37D7BA69.223F@REMOVE_THISlucent.com>,

Jeff David <jldavid@REMOVE_THISlucent.com> wrote:
> Cute. However, on my Sparc Ultra 1 this is no faster than my original
> proc (but much faster than the other two solutions posted).
>
> Here is what came up when I tested (linecount is my original proc,
> linecount4 is the above proc) on a 1,000,000 line 34 MB file:
>
> % time {linecount4 bigfile} 1
> 40008220 microseconds per iteration
> % time {linecount bigfile} 1
> 39764359 microseconds per iteration
>
> Both of these are very poor (by an order of magnitude) compared to
> doing a straight "wc -l" (which I cannot do because of
> cross-platform requirements):
>
> % time {catch {exec wc -l bigfile} lines} 1
> 3945380 microseconds per iteration

Odd. When I scaled up my example to use a larger dataset (59404376
bytes, 1447534 lines, so of the same sort of scale as your example) on
my Ultra5, I got:

time {exec wc -l data}
4044177 microseconds per iteration
% time {linecount data}
6584521 microseconds per iteration

Are you making the computer thrash? (A thrashing system isn't going
to beat anything other than a crashing system!) If so, you'll need to
turn down the read-size parameter (you can still get quite reasonable
behaviour with a read-size of 10k, which only adds about 5-10% to the
execution time...)

The [llength [split [read]]] version proposed elsewhere on this thread
takes about 50% longer than [regsub] with Tcl8.0.4 and doing it in 8.1
at all is unfair. However I'm surprised that 8.2 can't do fast
reading from a binary channel (unless that is, people are forgetting
to turn off the conversion of local encodings into UTF8/Unicode... :^)

Reply all
Reply to author
Forward
0 new messages