Which is faster? File I/O or lists

George Brown

unread,

Dec 22, 2001, 9:27:19 AM12/22/01

to

Which is faster, reading a file into a list, like with:
eval lappend list [split [read $file] \n]
and processing the list, or reading the file a line at a time, with:
gets $file line

I haven't done any bench mark timing but in an application I'm writing I put a
mechanism to switch between the two methods and I [much to my surprise] didn't
see an appreciable difference between list processing and file processing. The
file processing is sequential, parsing a variable length text file for lines
containing consistently formatted strings. I can't "grep" the file because when
I match a string, I need to get subsequent lines.

Any suggestions would be appreciated.

--
George and Cindy Brown e-mail: gncb...@adelphia.net
4 Frederick Drive
New Hartford, NY 13413

miguel sofer

unread,

Dec 22, 2001, 10:33:26 AM12/22/01

to

George Brown wrote:
>
> Which is faster, reading a file into a list, like with:
> eval lappend list [split [read $file] \n]
> and processing the list, or reading the file a line at a time, with:
> gets $file line
>

It is faster to read it into a list - at least as long as the file isn't
so large that you start swapping. However, the way you are doing it is
quite inefficient: you are generating the list, then appending all its
elements one by one to the string "lappend list", then evaling the
resulting string.

Instead of

eval lappend list [split [read $file] \n]

use directly

set list [split [read $file] \n]

as [split] already returns a list.

Cheers
Miguel Sofer

Cameron Laird

unread,

Dec 22, 2001, 12:20:11 PM12/22/01

to

.
.
.
Seconded. Big [read]s can be MUCH faster.
--

Cameron Laird <Cam...@Lairds.com>
Business: http://www.Phaseit.net
Personal: http://starbase.neosoft.com/~claird/home.html

Mark G. Saye

unread,

Dec 22, 2001, 1:20:54 PM12/22/01

to

George Brown wrote:
>
> Which is faster, reading a file into a list, like with:
> eval lappend list [split [read $file] \n]
> and processing the list, or reading the file a line at a time, with:
> gets $file line
>
> I haven't done any bench mark timing but in an application I'm writing I put a
> mechanism to switch between the two methods and I [much to my surprise] didn't
> see an appreciable difference between list processing and file processing. The
> file processing is sequential, parsing a variable length text file for lines
> containing consistently formatted strings. I can't "grep" the file because when
> I match a string, I need to get subsequent lines.

You can get a marginal improvement using the "read channelId numChars"
option to "read", as long as you are not using a multi-byte encoding.

To benchmark a procedure or script, you can use the "time script
?count?" tcl command, which return the average amount of time required
to perform the 'script
' 'count' times

For example:

# -----------------
proc proc1 {file} {
if { [catch {open $file r} fd] } {
return
} else {
set list [split [read $fd [file size $file]] \n]
close $fd
}
return $list
}

proc proc2 {file} {
if { [catch {open $file r} fd] } {
return
} else {
while { [gets $fd line] != -1 } {
lappend list $line
}
close $fd
}
return $list
}

# Find a large file for testing
set file "/etc/termcap"
puts "file='$file' size='[file size $file]'"
puts "time proc1='[time {proc1 $file} 10]'"
puts "time proc2='[time {proc2 $file} 10]'"
# -----------------

outputs:

file='/etc/termcap' size='702559'
time proc1='123952 microseconds per iteration'
time proc2='327944 microseconds per iteration'

This on an AMD K7 700Mhz, Linux (Mandrake 8.1), tcl v8.3.4

I found using the 'else' construct (as above) also slightly reduces the
time taken.

Mark /

--
Mark G. Saye
mark...@yahoo.com

George Brown

unread,

Dec 22, 2001, 1:53:56 PM12/22/01

to

Cameron Laird wrote:

> In article <3C24A7C6...@utdt.edu>, miguel sofer <m...@utdt.edu> wrote:
>
>>George Brown wrote:
>>
>>>Which is faster, reading a file into a list, like with:
>>> eval lappend list [split [read $file] \n]
>>>and processing the list, or reading the file a line at a time, with:
>>> gets $file line
>>>
>>>
>>It is faster to read it into a list - at least as long as the file isn't
>>so large that you start swapping. However, the way you are doing it is
>>quite inefficient: you are generating the list, then appending all its
>>elements one by one to the string "lappend list", then evaling the
>>resulting string.
>>
>>Instead of
>>
>> eval lappend list [split [read $file] \n]
>>
>>use directly
>>
>> set list [split [read $file] \n]
>>
>>as [split] already returns a list.
>>
> .
> .
> .
> Seconded. Big [read]s can be MUCH faster.
>

Thanks for the tip on reading the file. It is faster.

The unfortunate thing [for me at least] is the processing does not seem to be
faster when searching through a memory-resident list an element at a time (after
the entire file has been read in) as opposed to reading the file line-by-line. I
was hoping for noticeable performance improvements. Was I hoping for too much or
is there something else I can do in processing the list that will beat the
performance of file I/O?

Steve Offutt

unread,

Dec 22, 2001, 2:12:00 PM12/22/01

to

"George Brown" <gncb...@adelphia.net> wrote in message
news:3C24D6B7...@adelphia.net...

[snip]

> Thanks for the tip on reading the file. It is faster.
>
> The unfortunate thing [for me at least] is the processing does not seem to be
> faster when searching through a memory-resident list an element at a time (after
> the entire file has been read in) as opposed to reading the file line-by-line. I
> was hoping for noticeable performance improvements. Was I hoping for too much or
> is there something else I can do in processing the list that will beat the
> performance of file I/O?

George,

I suspect that if you were to post your proc for parsing the data, that the
pros around
here could help you to improve its performance.

These guys are the best...

Steve

--
Posted from dial131.sunflower.org [209.16.214.131]
via Mailgate.ORG Server - http://www.Mailgate.ORG

Bryan Oakley

unread,

Dec 22, 2001, 2:00:13 PM12/22/01

to

> >Instead of
> >
> > eval lappend list [split [read $file] \n]
> >
> >use directly
> >
> > set list [split [read $file] \n]
> >
> >as [split] already returns a list.
> .
> .
> .
> Seconded. Big [read]s can be MUCH faster.

... and don't forget that specifying the buffer size helps a bunch too

set list [split [read $file [file size $filename]] \n]

http://mini.net/cgi-bin/wikit/348.html

Uwe Klein

unread,

Dec 23, 2001, 4:36:56 AM12/23/01

to

George Brown wrote:
>
> Which is faster, reading a file into a list, like with:

> ..

> containing consistently formatted strings. I can't "grep" the file because when
> I match a string, I need to get subsequent lines.

Hi, not tcl but if you can grep you can grep with "grep -A <number of
interesting lines> ..
see man grep:
All variants of grep understand the following options:
-A NUM, --after-context=NUM
Print NUM lines of trailing context after matching
lines.
-B NUM, --before-context=NUM
Print NUM lines of leading context before matching
lines.
-C [NUM], --context[=NUM]
Print NUM lines (default 2) of output context.
-NUM Same as --context=NUM lines of leading and trailing
context. However, grep will never print any given
line more than once.

Chances for a "White Christmas" where you live?

G!
UK

--
Uwe Klein [mailto:uwe-...@foni.net]
KLEIN MESSGERAETE Habertwedt 1
D-24376 Groedersby b. Kappeln, GERMANY
phone: +49 4642 920 123 FAX: +49 4642 920 125

Roy Terry

unread,

Dec 23, 2001, 9:09:26 AM12/23/01

to

I presume you're searching with "lsearch"?
The latest version tcl8.4* can do a binary
search if the list is sorted [lsort] and
the -sorted flag is passed to lsearch.

I suggest as others that you post the "slow" code
for suggestions.

Cheers,
Roy

George Brown

unread,

Dec 23, 2001, 10:17:48 AM12/23/01

to

Steve Offutt wrote:

> "George Brown" <gncb...@adelphia.net> wrote in message
> news:3C24D6B7...@adelphia.net...
>
> [snip]
>
>
>>Thanks for the tip on reading the file. It is faster.
>>
>>The unfortunate thing [for me at least] is the processing does not seem to be
>>faster when searching through a memory-resident list an element at a time (after
>>the entire file has been read in) as opposed to reading the file line-by-line. I
>>was hoping for noticeable performance improvements. Was I hoping for too much or
>>is there something else I can do in processing the list that will beat the
>>performance of file I/O?
>>
>
> George,
>
> I suspect that if you were to post your proc for parsing the data, that the
> pros around
> here could help you to improve its performance.
>
> These guys are the best...

I can set read_from_file to read a line at a time or read the entire file into a
list. I then have a get_next proc that returns the next line. Here are some
snippets. Essentially I'm asking why isn't my non-file (ie. list) processing
faster than the read-a-line-at-a-time processing?

proc open_scn_log_file {message} {
if {!$scn_file_open} {
if { [catch {open $scn_log_file_name r} scn_file] } {
tk_messageBox -icon error \
-message "Could not open log \"$scn_log_file_name\"." \
-title "Error" -type ok
set scn_file_eof 1
return 0
}
set scn_file_open 1
}

if {!$read_from_file} {
set mtime [file mtime $scn_log_file_name]
if {$scn_file_mtime < $mtime} {
busy start "$message from $scn_log_file_name..."
seek $scn_file $scn_file_byte start
set scn_file_contents [split [read $scn_file [file size $scn_log_file_name]] \n]
set scn_file_byte [tell $scn_file]
} else {
busy start "$message from file $scn_log_file_name..."
}
set scn_file_mtime $mtime
close $scn_file
set scn_file_open 0
} else {
busy start "$message from file $scn_log_file_name..."
}
return 1
}

proc get_next_scn_line {line} {
upvar $line myLine
if {$read_from_file} {
if {[catch {gets $scn_file myLine}]} {
set scn_file_eof 1
return 0
} elseif {[eof $scn_file]} {
set scn_file_eof 1
return 0
}
} else {
if {$scn_file_index >= [llength $scn_file_contents]} {
set scn_file_eof 1
set myLine ""
return 0
}

set myLine [lindex $scn_file_contents $scn_file_index]
incr scn_file_index
}
return 1
}

Typical processing might be:
while {[get_next_scn_line line]} {
if {[regexp {TRACKER ID: (.*)$} $line match tracker]} {
#debug_info "looking at: $tracker"
if {[lsearch -regexp $all_scn_list "^$tracker$"] == -1} {
lappend all_scn_list "$tracker"
}
# DO MORE STUFF
}
}

I also use [string first ...] and [regexp ...] with paren matching to match into
variables for things I'm looking for.

Thanks for any suggestions.

Roy Terry

unread,

Dec 23, 2001, 4:52:21 PM12/23/01

to

George Brown wrote:

> snippets. Essentially I'm asking why isn't my non-file (ie. list) processing
> faster than the read-a-line-at-a-time processing?

Perhaps I miss something, but I think the answer is "buffering".
The file is not being read line-by-line and so the io overhead saving
between advance read and read as needed is negligible - this is good.

Is the real question how to make it run faster?
If so, then my comments are:

>Typical processing might be:
> while {[get_next_scn_line line]} {
> if {[regexp {TRACKER ID: (.*)$} $line match tracker]} {
> #debug_info "looking at: $tracker"
> if {[lsearch -regexp $all_scn_list "^$tracker$"] == -1} {
> lappend all_scn_list "$tracker"
> }
> # DO MORE STUFF
> }
> }

1. Looks like the -regexp to lsearch is unnecessary as by
using ^...$ you make it equivalent to an ordinary literal
match (unless you have regexp special chars in $tracker!).

2. Depending on the average length or $tracker you should
try using a array to detect uniqueness:
if { ! [info exists trkarray($tracker)]} {
lappend ...

set trkarray($tracker) 1
...
HTH,
Roy

Ken Jones

unread,

Dec 26, 2001, 12:58:50 AM12/26/01

to

Roy Terry <royt...@earthlink.net> wrote in message news:<3C2651D5...@earthlink.net>...

> George Brown wrote:
>
> > snippets. Essentially I'm asking why isn't my non-file (ie. list)
> > processing faster than the read-a-line-at-a-time processing?
>
> Perhaps I miss something, but I think the answer is "buffering".
> The file is not being read line-by-line and so the io overhead saving
> between advance read and read as needed is negligible - this is good.

Don't assume that your I/O is your performance bottleneck.

I noticed that in this thread and your previous thread, you didn't
mention how big the files are that you're processing. Common Tcl
wisdom is that "read" is faster than "gets", but when you're wondering
how much of a difference it really makes, try it out!

You can run comparison tests on your system by using the Tcl "time"
command to time how long it takes to execute a chunk of code. See the
"time" reference page and http://mini.net/tcl/wiki/348.html on the Tcl
Wiki for more information.

Here's an example I just ran on my system, an IBM Thinkpad T20 with a
750MHz Pentium III with 192MB RAM running Windows 2000 and Tcl 8.3.2
(yes, I need to upgrade!):

% time {
set file "tkcon.tcl"
set fid [open $file r]
foreach line [split [read $fid [file size $file]] "\n"] {}
close $fid
} 1000
42622 microseconds per iteration
% time {
set file "tkcon.tcl"
set fid [open $file r]
while {[gets $fid line] >= 0} {}
close $fid
} 1000
78423 microseconds per iteration
% file size $file
140159

"read" was faster in this case. But if a file is quite small, the
difference is going to be negligable. And if the file is quite large,
"read" might bog down by inducing paging. So you can try similar tests
to get a sense of the "sweet spot" on your particular systems.

> Is the real question how to make it run faster?
> If so, then my comments are:
>
> >Typical processing might be:
> > while {[get_next_scn_line line]} {
> > if {[regexp {TRACKER ID: (.*)$} $line match tracker]} {
> > #debug_info "looking at: $tracker"
> > if {[lsearch -regexp $all_scn_list "^$tracker$"] == -1} {
> > lappend all_scn_list "$tracker"
> > }
> > # DO MORE STUFF
> > }
> > }
>
>
> 1. Looks like the -regexp to lsearch is unnecessary as by
> using ^...$ you make it equivalent to an ordinary literal
> match (unless you have regexp special chars in $tracker!).
>
> 2. Depending on the average length or $tracker you should
> try using a array to detect uniqueness:
> if { ! [info exists trkarray($tracker)]} {
> lappend ...
>
> set trkarray($tracker) 1
> ...

Good spot by Roy. Although regular expressions are very powerful and
can speed string processing up a lot, they can be slow to compile. So
your "lsearch -regexp" is going to be very expensive if you don't need
the power of regular expressions in your search.

Using my system for some hard data:

% time {
catch {unset elems}
set elems {}
for {set i 1} {$i <= 1000} {incr i} {
set val [expr {int(rand()*1000)}]
if {[lsearch -regexp $elems "^$val$"] == -1} {
lappend elems $val
}
}
} 100
1115300 microseconds per iteration
% time {
catch {unset elems}
set elems {}
for {set i 1} {$i <= 1000} {incr i} {
set val [expr {int(rand()*1000)}]
if {[lsearch -exact $elems "$val"] == -1} {
lappend elems $val
}
}
} 100
46570 microseconds per iteration

Ouch! Using the "-regexp" vs. "-exact" was almost *25 times* slower
for these 1,000 random values! We can speed this up even more with
Roy's suggestion of using Tcl arrays to filter out duplicates. If the
order of the elements isn't important, we can use "array names" to get
our list of unique elements. Otherwise, we'll still build the list
using "lappend", but use the array to detect duplicates:

% time {
catch {unset elems}
catch {unset elemlist}
for {set i 1} {$i <= 1000} {incr i} {
set val [expr {int(rand()*1000)}]
set elems($val) 1
}
set elemlist [array names elems]
} 100
10820 microseconds per iteration
% time {
catch {unset elems}
catch {unset elemlist}
for {set i 1} {$i <= 1000} {incr i} {
set val [expr {int(rand()*1000)}]
if {![info exists elems($val)]} {
set elems($val) 1
lappend elemlist $val
}
}
} 100
18620 microseconds per iteration

So, even retaining the original order of input elements, we were able
to make this little chunk of code run in only 1.7% of the original
time (once again, assuming 1,000 elements of random numerical data).
Using other tips from the Tcl Performance page of the Tcl Wiki (that
URL again was http://mini.net/tcl/wiki/348.html), you might be able to
achieve similar optimizations elsewhere in your code.

- Ken Jones, President
Avia Training and Consulting
www.avia-training.com
866-TCL-HELP (866-825-4357) US Toll free
415-643-8692 Voice
415-643-8697 Fax

Peter da Silva

unread,

Dec 26, 2001, 11:34:10 AM12/26/01

to

In article <5202b141.0112...@posting.google.com>,

Ken Jones <k...@avia-training.com> wrote:
> Good spot by Roy. Although regular expressions are very powerful and
> can speed string processing up a lot, they can be slow to compile. So
> your "lsearch -regexp" is going to be very expensive if you don't need
> the power of regular expressions in your search.

I'm surprised the regexp isn't cached in compiled state.

--
`-_-' In hoc signo hack, Peter da Silva.
'U` "A well-rounded geek should be able to geek about anything."
-- nic...@esperi.org
Disclaimer: WWFD?

Jeffrey Hobbs

unread,

Dec 26, 2001, 11:50:26 AM12/26/01

to

Peter da Silva wrote:
>
> In article <5202b141.0112...@posting.google.com>,
> Ken Jones <k...@avia-training.com> wrote:
> > Good spot by Roy. Although regular expressions are very powerful and
> > can speed string processing up a lot, they can be slow to compile. So
> > your "lsearch -regexp" is going to be very expensive if you don't need
> > the power of regular expressions in your search.
>
> I'm surprised the regexp isn't cached in compiled state.

They are, both as regular objects, and the last 30 static strings
(to catch those that aren't placed in objects, but used in something
like a loop).

--
Jeff Hobbs The Tcl Guy
Senior Developer http://www.ActiveState.com/
Tcl Support and Productivity Solutions

Jeff Hobbs

unread,

Dec 26, 2001, 4:59:48 PM12/26/01

to

Ken Jones wrote:
> Don't assume that your I/O is your performance bottleneck.

...

> > 1. Looks like the -regexp to lsearch is unnecessary as by
> > using ^...$ you make it equivalent to an ordinary literal
> > match (unless you have regexp special chars in $tracker!).

> for {set i 1} {$i <= 1000} {incr i} {

> set val [expr {int(rand()*1000)}]
> if {[lsearch -regexp $elems "^$val$"] == -1} {

versus:

> if {[lsearch -exact $elems "$val"] == -1} {

...

> Ouch! Using the "-regexp" vs. "-exact" was almost *25 times* slower
> for these 1,000 random values! We can speed this up even more with
> Roy's suggestion of using Tcl arrays to filter out duplicates. If the

Roy and Ken did a good job in breaking this down into a much more
reasonable algorithmic use of Tcl features. Peter followed up on
my comment wondering why no caching was done, and then to my response
why the regexp above was so slow. If you look at the above code, you
will see that a brand new regular expression is created each time,
being the value of the new random number given Tcl's subst rules.
The regexp engine doesn't see "^$val$", it sees e.g. "^956$". That
is why regexps were inappropriate in the first place.

Also, for the example above, in 8.4 you can make that a little bit
faster by using [lsearch -integer -exact ...], but the array method
is still faster. The only reason you would use lsearch is if you
*really* wanted to save the memory overhead of the array.

--
Jeff Hobbs The Tcl Guy
Senior Developer http://www.ActiveState.com/
Tcl Support and Productivity Solutions

http://www.ActiveState.com/Products/ASPN_Tcl/