proc linecount {file} {
set i 0
set fid [open $file r]
while {[gets $fid line] > -1} {incr i}
close $fid
return i
}
For a 1,000,000 line file this takes almost a minute to run on an Ultra 1.
wc -l takes 4-5 seconds. Increasing the buffersize using fconfigure also
does not help (nor would I expect it to in this case).
The line lengths are not equal so I cannot consider doing something like
expr {[file size $file] / $linesize}.
Any ideas?
Jeff David
jld...@lucent.com
Jeff,
You are suffering from a small read buffer. It is *much* faster if you
do one of two things:
1. Increase the file buffer size (to, say, a million bytes):
set fid [open $file r]
fconfigure $fid -buffersize 1000000
...
2. Slurp up the whole file in a single, optimized, read,
and use 'split' to separate the data into lines:
set fid [open $file r]
set data [read $fid [file size $file]]
set lineCount [llength [split $data "\n"]]
Bob
--
Bob Techentin techenti...@mayo.edu
Mayo Foundation (507) 284-2702
Rochester MN, 55905 USA http://www.mayo.edu/sppdg/sppdg_home_page.html
> You are suffering from a small read buffer. It is *much* faster if you
> do one of two things:
>
> 1. Increase the file buffer size (to, say, a million bytes):
>
> set fid [open $file r]
> fconfigure $fid -buffersize 1000000
> ...
First, I tried this with the procedure I posted and, as I originally
stated,
it did no good whatsoever. The times were identical regardless of the
buffer
size. When doing what you suggest below, it is certainly more efficient
to use large buffer sizes. I'm not sure you gain anything if you are
using
gets on 30-40 character lines.
>
> 2. Slurp up the whole file in a single, optimized, read,
> and use 'split' to separate the data into lines:
>
> set fid [open $file r]
> set data [read $fid [file size $file]]
> set lineCount [llength [split $data "\n"]]
>
Well, I just tried this using a buffersize of 1,000,000. The time to
compute the number of lines INCREASED to 1 minute and 25 seconds from
about 58 seconds in the original proc.
Any other ideas out there?
Jeff David
Try, instead of using [split], doing this
set lineCount [regsub -all \n $data {} data]
-kl
Hmm. A million lines is a lot of data, and so we want to try to avoid
copying it as much as we can (since copying is *very*slow* when done
loads of times.) Working in large chunks is good too (since syscalls
are slow too.)
proc linecount {file {eofchar "\n"}} {
set i 0
set fid [open $file]
# Use a 512K buffer.
fconfigure $fid -buffersize 524288 -translation binary
while {![eof $fid]} {
incr i [regsub -all $eofchar [read $fid 524288] $eofchar junk]
}
close $fid
return $i
}
Note that when the input data is coming from a Mac, you will need to
pass eofchar as "\r" instead of "\n", since I disable the normal
conversion stuff for speed. The above code seems to be substantially
faster than your version, though not quite as fast as [exec wc -l].
On a handy 4MB/100kLine file (with a very wide mix of line lengths),
my timings (Sun Ultra 5/Solaris 2.7/Tcl 8.0.4 with data coming over
NFS) were:
% time {linecount $testfile}
3883065 microseconds per iteration
% time {linecount2 $testfile}
565478 microseconds per iteration
% time {exec wc -l $testfile}
472118 microseconds per iteration
I don't know about speeds with later versions, since I've not upgraded
yet...
The main tunable parameter is how much data to slurp in at once. That
is probably best set to something that is a multiple of the filesystem
chunk size (e.g. the size of a cluster on FAT systems) and 512KB fits
the bill fairly well while not taking too much memory. Larger sizes
are potentially faster (and slurping in the whole lot in one call is
ideal) but you've probably got too much data for that to be a serious
proposition.
Donal.
--
Donal K. Fellows http://www.cs.man.ac.uk/~fellowsd/ fell...@cs.man.ac.uk
-- The small advantage of not having California being part of my country would
be overweighed by having California as a heavily-armed rabid weasel on our
borders. -- David Parsons <o r c @ p e l l . p o r t l a n d . o r . u s>
Is it fast to just compare a byte rather than using regsub?
> proc linecount {file {eofchar "\n"}} {
> set i 0
> set fid [open $file]
> # Use a 512K buffer.
> fconfigure $fid -buffersize 524288 -translation binary
> while {![eof $fid]} {
> incr i [regsub -all $eofchar [read $fid 524288] $eofchar junk]
> }
> close $fid
> return $i
> }
>
>
> Donal.
> --
> Donal K. Fellows http://www.cs.man.ac.uk/~fellowsd/ fell...@cs.man.ac.uk
> -- The small advantage of not having California being part of my country would
> be overweighed by having California as a heavily-armed rabid weasel on our
> borders. -- David Parsons <o r c @ p e l l . p o r t l a n d . o r . u s>
--
--------------------------------------------------------------
Chang LI, Neatware
email: cha...@neatware.com
web: http://www.neatware.com
--------------------------------------------------------------
: Note that when the input data is coming from a Mac, you will need to
: pass eofchar as "\r" instead of "\n", since I disable the normal
: conversion stuff for speed.
Some nitpicking: That should be eolchar instead of eofchar!
Bye, Heribert (da...@ifk20.mach.uni-karlsruhe.de)
Fixed in the Wiki version.
Donal.
Sent via Deja.com http://www.deja.com/
Share what you know. Learn what you don't.
Do you know a faster way (than [regsub -all], that is) to count the
instances of a particular character in a string? Note that [string
first] is definitely not faster since it means you have to implement
the loop over the buffer in Tcl (which is done in C with the regsub
version...)
Here is what came up when I tested (linecount is my original proc,
linecount4 is the above proc) on a 1,000,000 line 34 MB file:
% time {linecount4 bigfile} 1
40008220 microseconds per iteration
% time {linecount bigfile} 1
39764359 microseconds per iteration
Both of these are very poor (by an order of magnitude) compared to doing
a straight "wc -l" (which I cannot do because of cross-platform requirements):
% time {catch {exec wc -l bigfile} lines} 1
3945380 microseconds per iteration
Thanks for trying.
Jeff David
That depends on what Tcl version you are using.
According to my measurements, regsub -all is fastest on 8.0.
With 8.1, and even more so with 8.2, regsub gets slow and
split+llength is faster.
Tests on Solaris, with a 100k lines file:
The first line is the regsub version, slighly optimised (having
an empty subSpec seems to be slightly faster).
The second is a split+llength version.
Both are attached below.
8.0p2
471877 microseconds per iteration
612888 microseconds per iteration
8.1.1
2519146 microseconds per iteration
1298369 microseconds per iteration
8.2.0
4536487 microseconds per iteration
1394101 microseconds per iteration
proc linecount {file {eofchar "\n"}} {
set i 0
set fid [open $file]
# Use a 512K buffer.
fconfigure $fid -buffersize 524288 -translation binary
while {![eof $fid]} {
incr i [regsub -all $eofchar [read $fid 524288] "" junk]
}
close $fid
return $i
}
proc linecount2 {file {eofchar "\n"}} {
set i 0
set fid [open $file]
# Use a 512K buffer.
fconfigure $fid -buffersize 524288 -translation binary
while {![eof $fid]} {
incr i [expr {[llength [split [read $fid 524288] $eofchar]] -
1}]
}
close $fid
return $i
}
/Peter
Hmmmm... That's weird. Donal's test showed that his code was much faster
than your original code, but a little slower than wc -l. Just a thought: are
you using Tcl 8.1? I know that the first 8.1 release was *really* slow for
string manipulation. Try 8.0.5 or 8.2.
L
--
Penguin Power! Nothing I say reflects the views of my employer
Laurent Duperval mailto:laurent....@cgi.ca
CGI - FWFM Project Phone: (514) 391-9523
I am happy on your result for 8.2.0. threefold speedup!
> 8.0p2
> 471877 microseconds per iteration
> 612888 microseconds per iteration
> 8.1.1
> 2519146 microseconds per iteration
> 1298369 microseconds per iteration
> 8.2.0
> 4536487 microseconds per iteration
> 1394101 microseconds per iteration
>
> proc linecount2 {file {eofchar "\n"}} {
> set i 0
> set fid [open $file]
> # Use a 512K buffer.
> fconfigure $fid -buffersize 524288 -translation binary
> while {![eof $fid]} {
> # incr i [expr {[llength [split [read $fid 524288] $eofchar]] -
> 1}]
# it maybe a little bit faster with
incr i [llength [split [read $fid 524288] $eofchar]]
incr i -1
> }
> close $fid
> return $i
> }
>
> /Peter
--
Odd. When I scaled up my example to use a larger dataset (59404376
bytes, 1447534 lines, so of the same sort of scale as your example) on
my Ultra5, I got:
time {exec wc -l data}
4044177 microseconds per iteration
% time {linecount data}
6584521 microseconds per iteration
Are you making the computer thrash? (A thrashing system isn't going
to beat anything other than a crashing system!) If so, you'll need to
turn down the read-size parameter (you can still get quite reasonable
behaviour with a read-size of 10k, which only adds about 5-10% to the
execution time...)
The [llength [split [read]]] version proposed elsewhere on this thread
takes about 50% longer than [regsub] with Tcl8.0.4 and doing it in 8.1
at all is unfair. However I'm surprised that 8.2 can't do fast
reading from a binary channel (unless that is, people are forgetting
to turn off the conversion of local encodings into UTF8/Unicode... :^)