Fastest way to shove data from file to tcl-array

16 views
Skip to first unread message

Jonas Beskow

unread,
May 26, 1999, 3:00:00 AM5/26/99
to
Greetings,

We have a large dictionary (3MB, 130K lines), that is to be read into an array
(one line per element) When read line by line, the whole thing takes about 70
seconds (on a 200MHz laptop). I also tried using the zipchan extension, to read
a zipped version of the dictionary (840 KB) directly. This shaved off 10
seconds.
Now, if it would speed things up, it would be OK to do a little pre-processing
of the dictinary file, i.e. it could be formatted and stored as tcl-code,
something like:

array set a {
key1 val1
key2 val2
...
}

or
set a(key1) val1
set a(key2) val2
...

or something completly different. Also, would it be significantly faster to
write a C-extension that just reads lines from a file and builds up the array
(using Tcl_SetVar2, effectively bypassing the interpreter) or is this overkill?

If there is no (significantly) faster way of doing it, then it's fine. I just
want to be sure we're doing the optimal thing. We're using tcl 8.02 on Windows

Thanks for any input
- Jonas

--
Jonas Beskow
Perceptual Science Laboratory || Centre for Speech Technology
University of Calif, Santa Cruz || KTH
bes...@fuzzy.ucsc.edu || bes...@speech.kth.se

Chang LI

unread,
May 26, 1999, 3:00:00 AM5/26/99
to
Jonas Beskow wrote:
>
> Greetings,
>

It is not optimized to read line by line. It is better to read a block
such as
16K or 64K. The Tcl's file APIs are not rich.

> We have a large dictionary (3MB, 130K lines), that is to be read into an array
> (one line per element) When read line by line, the whole thing takes about 70
> seconds (on a 200MHz laptop). I also tried using the zipchan extension, to read
> a zipped version of the dictionary (840 KB) directly. This shaved off 10
> seconds.
> Now, if it would speed things up, it would be OK to do a little pre-processing
> of the dictinary file, i.e. it could be formatted and stored as tcl-code,
> something like:
>
> array set a {
> key1 val1
> key2 val2
> ...
> }
>
> or
> set a(key1) val1
> set a(key2) val2
> ...
>

It should be much fast.



> or something completly different. Also, would it be significantly faster to
> write a C-extension that just reads lines from a file and builds up the array
> (using Tcl_SetVar2, effectively bypassing the interpreter) or is this overkill?
>
> If there is no (significantly) faster way of doing it, then it's fine. I just
> want to be sure we're doing the optimal thing. We're using tcl 8.02 on Windows
>
> Thanks for any input
> - Jonas
>
> --
> Jonas Beskow
> Perceptual Science Laboratory || Centre for Speech Technology
> University of Calif, Santa Cruz || KTH
> bes...@fuzzy.ucsc.edu || bes...@speech.kth.se

--
--------------------------------------------------------------
Chang LI, Neatware
email: cha...@neatware.com
web: http://www.neatware.com
--------------------------------------------------------------

Bryan Oakley

unread,
May 26, 1999, 3:00:00 AM5/26/99
to
Jonas Beskow <bes...@fuzzy.ucsc.edu> wrote in message
news:374C4F92...@fuzzy.ucsc.edu...
> Greetings,

>
> We have a large dictionary (3MB, 130K lines), that is to be read into an
array
> (one line per element) When read line by line, the whole thing takes about
70
> seconds (on a 200MHz laptop). I also tried using the zipchan extension, to
read
> a zipped version of the dictionary (840 KB) directly. This shaved off 10
> seconds.

It's likely your bottleneck is in the reading of the file line by line. try
using the read command, perhaps followed by split (using the newline as the
separator).

> Now, if it would speed things up, it would be OK to do a little
pre-processing
> of the dictinary file, i.e. it could be formatted and stored as tcl-code,
> something like:
>
> array set a {
> key1 val1
> key2 val2
> ...
> }
>
> or
> set a(key1) val1
> set a(key2) val2
> ...

That is often a very fine idea. The best way to see the performance
improvement (or not) is to try it and see. I'm guessing it will make a very
big difference.

>
> or something completly different. Also, would it be significantly faster
to
> write a C-extension that just reads lines from a file and builds up the
array
> (using Tcl_SetVar2, effectively bypassing the interpreter) or is this
overkill?

My guess is it is overkill. You will probably see better results slurping
the data in in really big blocks using read, and possibly formatting the
data to be in a tcl-friendly format.

>
> If there is no (significantly) faster way of doing it, then it's fine. I
just
> want to be sure we're doing the optimal thing. We're using tcl 8.02 on
Windows

In the world of performance, especially when tcl is concerned, it's often
fastest and easiest to simply try out your ideas. Tcl is wonderful for such
tasks, since it's so easy to write and modify scripts.

Best of luck. I bet you can get it down to 10 seconds or so.

FWIW, I just wrote a test script that read in 1300000 lines of text
(weighing in at 3.7 megabytes) and assigns each line to an element in an
array. Using [read] and [split] was more than twice as fast as using gets.
So it could be that making that simple change will cut your time in half (at
the expense of having to allocate enough memory to temporarily hold two
copies of the data).

Dan Gunter

unread,
May 26, 1999, 3:00:00 AM5/26/99
to
Jonas Beskow wrote:
>
> Greetings,
>
> We have a large dictionary (3MB, 130K lines), that is to be read into an array
> (one line per element) When read line by line, the whole thing takes about 70
> seconds (on a 200MHz laptop). I also tried using the zipchan extension, to read
> a zipped version of the dictionary (840 KB) directly. This shaved off 10
> seconds.
> Now, if it would speed things up, it would be OK to do a little pre-processing
> of the dictinary file, i.e. it could be formatted and stored as tcl-code,
> something like:
>
> array set a {
> key1 val1
> key2 val2
> ...
> }
>
> or
> set a(key1) val1
> set a(key2) val2
> ...
>
> or something completly different. Also, would it be significantly faster to
> write a C-extension that just reads lines from a file and builds up the array
> (using Tcl_SetVar2, effectively bypassing the interpreter) or is this overkill?

I think that this is worth a try, especially if you use Tcl_ObjSetVar2()
instead. C I/O is much much faster (as you may already know) than Tcl.

You might also ask yourself whether it is more efficient to store the
entire dictionary entry in the "value", or maybe just the file offset
and length in bytes? This depends on how often you would need to
actually use the entry.

>
> If there is no (significantly) faster way of doing it, then it's fine. I just
> want to be sure we're doing the optimal thing. We're using tcl 8.02 on Windows
>

> Thanks for any input
> - Jonas
>
> --
> Jonas Beskow
> Perceptual Science Laboratory || Centre for Speech Technology
> University of Calif, Santa Cruz || KTH
> bes...@fuzzy.ucsc.edu || bes...@speech.kth.se


--
/* Dan Gunter (da...@george.lbl.gov) */

Andreas Kupries

unread,
May 26, 1999, 3:00:00 AM5/26/99
to
Jonas Beskow <bes...@fuzzy.ucsc.edu> writes:

> Greetings,

> We have a large dictionary (3MB, 130K lines), that is to be read
> into an array (one line per element) When read line by line, the
> whole thing takes about 70 seconds (on a 200MHz laptop). I also
> tried using the zipchan extension, to read a zipped version of the

Can you post an url for this one ? I would like to compare it against
my Trf.


> dictionary (840 KB) directly. This shaved off 10 seconds. Now, if
> it would speed things up, it would be OK to do a little
> pre-processing of the dictinary file, i.e. it could be formatted and
> stored as tcl-code, something like:

> array set a {
> key1 val1
> key2 val2
> ...
> }

> or
> set a(key1) val1
> set a(key2) val2
> ...

One advice at

http://purl.org/thecliff/tcl/wiki/TclPerformance

is to 'time' it. So, convert your file to the format to test, then do

puts [time {source yourFile} 1]

to check out the speed.


Remark: One trick to remember for reading large files is to determine
their size beforehand and then to tell 'read' that number, allowing
the command to do a better allocation of channel buffers. So:

set sz [file size yourFile]
set f [open yourFile r]
set data [read $f $sz]
close $f

foreach {key value} [split $data \n] {
set yourArray($key) $value
}
set data {} ; unset data ; # one of the two commands should
# free the associated memory, but I don't remember which one

--
Sincerely,
Andreas Kupries <a.ku...@westend.com>
<http://www.westend.com/~kupries/>
-------------------------------------------------------------------------------

Dave Warner

unread,
May 27, 1999, 3:00:00 AM5/27/99
to
a.ku...@westend.com wrote:

[snip ...]

>
> Remark: One trick to remember for reading large files is to determine
> their size beforehand and then to tell 'read' that number, allowing
> the command to do a better allocation of channel buffers. So:

[snip ...]

1.
set fd [open "/tmp/X" r]
set data [read $fd]
close $fd

real 0m57.86s
user 0m57.09s
sys 0m0.75s

2.
set size [file size "/tmp/X"]
set fd [open "/tmp/X" r]
set data [read $fd $size]
close $fd

real 0m0.36s
user 0m0.25s
sys 0m0.11s

where /tmp/X: -rw-r--r-- 1 itsadm other 4367393 May 26 20:53 /tmp/X

Maybe this "trick" ic common knowledge but I sure wasn't aware of it; I've
just recoded a smallish app (a log file parser) in C because of the results
in 1. -- I'll now reconsider that drastic act.

c.l.t. strikes again! -- thanks

--
Dave Warner
Lucent Technologies, Inc.
+1-303-538-1748

Kai Harrekilde-Petersen

unread,
May 27, 1999, 3:00:00 AM5/27/99
to
Andreas Kupries <a.ku...@westend.com> writes:

</lurk>

[warning: tcl-newbie mode is set]

> Remark: One trick to remember for reading large files is to determine
> their size beforehand and then to tell 'read' that number, allowing
> the command to do a better allocation of channel buffers. So:
>

> set sz [file size yourFile]
> set f [open yourFile r]
> set data [read $f $sz]
> close $f
>
> foreach {key value} [split $data \n] {
> set yourArray($key) $value
> }
> set data {} ; unset data ; # one of the two commands should
> # free the associated memory, but I don't remember which one

Excellent trick, but this assumes that you can write the code as a simple
foreach loop around the processing. I have a simple parser, searching for
certain keywords, and the current structure of the program doesn't lend itself
easily to that structure (yes, I could rewrite it to make it match, but I'd
like to avoid that for the time being).

How can I read the file (or at least portions of it - the file I'm parsing
could easily be 10-20MBytes, and I've worked on 70MB files before) into a
buffer, and then return a line at the time to the program (ie: effectively
replacing the gets calls)?

Using split to turn it into a list, and then using lindex would be a solution,
but how efficient would that be? Hmm, this thing would eat memory or
breakfast.

<lurk>

TIA,


Kai
--
Kai Harrekilde-Petersen <k...@olicom.dk>
Don't blame my employer for my opinions.

Paul Duffin

unread,
May 27, 1999, 3:00:00 AM5/27/99
to
Dave Warner wrote:
>
> a.ku...@westend.com wrote:
>
> [snip ...]

>
> >
> > Remark: One trick to remember for reading large files is to determine
> > their size beforehand and then to tell 'read' that number, allowing
> > the command to do a better allocation of channel buffers. So:
>
> [snip ...]
>
> 1.
> set fd [open "/tmp/X" r]
> set data [read $fd]
> close $fd
>
> real 0m57.86s
> user 0m57.09s
> sys 0m0.75s
>
> 2.
> set size [file size "/tmp/X"]
> set fd [open "/tmp/X" r]
> set data [read $fd $size]
> close $fd
>
> real 0m0.36s
> user 0m0.25s
> sys 0m0.11s
>
> where /tmp/X: -rw-r--r-- 1 itsadm other 4367393 May 26 20:53 /tmp/X
>
> Maybe this "trick" ic common knowledge but I sure wasn't aware of it; I've
> just recoded a smallish app (a log file parser) in C because of the results
> in 1. -- I'll now reconsider that drastic act.
>
> c.l.t. strikes again! -- thanks
>

Anyone prepared to comment why this should be so significant ?

--
Paul Duffin
DT/6000 Development Email: pdu...@hursley.ibm.com
IBM UK Laboratories Ltd., Hursley Park nr. Winchester
Internal: 7-246880 International: +44 1962-816880

Paul Duffin

unread,
May 27, 1999, 3:00:00 AM5/27/99
to
Andreas Kupries wrote:
>
> Remark: One trick to remember for reading large files is to determine
> their size beforehand and then to tell 'read' that number, allowing
> the command to do a better allocation of channel buffers. So:
>

Is it possible to do better even without passing the size.

> set sz [file size yourFile]
> set f [open yourFile r]
> set data [read $f $sz]
> close $f
>
> foreach {key value} [split $data \n] {
> set yourArray($key) $value
> }
> set data {} ; unset data ; # one of the two commands should
> # free the associated memory, but I don't remember which one
>

Either will.

Jonas Beskow

unread,
May 27, 1999, 3:00:00 AM5/27/99
to
Bryan, thanks for your comments.

Bryan Oakley wrote:
> It's likely your bottleneck is in the reading of the file line by line. try
> using the read command, perhaps followed by split (using the newline as the
> separator).

...


> FWIW, I just wrote a test script that read in 1300000 lines of text
> (weighing in at 3.7 megabytes) and assigns each line to an element in an
> array. Using [read] and [split] was more than twice as fast as using gets.
> So it could be that making that simple change will cut your time in half (at
> the expense of having to allocate enough memory to temporarily hold two
> copies of the data).

I tried this. Oddly enough read/split is about 50% slower than gets in my case!
As you hinted at, however, source'ing a file with one big array set {....} is by
far the fastest (about 6 times faster than read/split), here are some benchmarks
I arrived at:

method: gets
108937 elements read in 19125000 us = 5696.05228758 elements/second
method: read and split
108937 elements read in 32875000 us = 3313.6730038 elements/second
method: source and set
107968 elements read in 7515000 us = 14366.9993347 elements/second
method: source and array set
107968 elements read in 5531000 us = 19520.5207015 elements/second
method: source and array set (sorted)
107968 elements read in 5719000 us = 18878.8249694 elements/second

in the above,
the first two cases open the file in a standard way using "open" and the last
three use "source"

"source and set" refers to sourcing a file with many lines of the type "set
a(aa) xx".
"source and array set" sources a file with one long statement "array set {aa xx
bb yy ...}"
the last one is similar but the elements of the array are sorted alphabetically
in the file (for imporved readability)

I'm attaching the test script below, so you can see what I'm doing, if you
actually want to run it, the dictionary I'm using is at
ftp://ftp.cs.cmu.edu/project/speech/dict/cmudict.0.6.gz

cheers

readtest.tcl

Kai Harrekilde-Petersen

unread,
May 27, 1999, 3:00:00 AM5/27/99
to
Kai Harrekilde-Petersen <k...@olicom.dk> writes:

> How can I read the file (or at least portions of it - the file I'm parsing
> could easily be 10-20MBytes, and I've worked on 70MB files before) into a
> buffer, and then return a line at the time to the program (ie: effectively
> replacing the gets calls)?
>
> Using split to turn it into a list, and then using lindex would be a solution,
> but how efficient would that be? Hmm, this thing would eat memory or
> breakfast.

I went back at did this. The original version took 6.0 sec, while using
read/split/lindex took 16.3 sec for a 2.4MB file. Not the way to go.


--Kai

Alexandre Ferrieux

unread,
May 27, 1999, 3:00:00 AM5/27/99
to
Paul Duffin wrote:
>
> Dave Warner wrote:
> >
> > a.ku...@westend.com wrote:
> >
> > [snip ...]
> >
> > >
> > > Remark: One trick to remember for reading large files is to determine
> > > their size beforehand and then to tell 'read' that number, allowing
> > > the command to do a better allocation of channel buffers. So:
> >
> > [snip ...]
> >
> > 1.
> > set fd [open "/tmp/X" r]
> > set data [read $fd]
> > close $fd
> >
> > real 0m57.86s
> > user 0m57.09s
> > sys 0m0.75s
> >
> > 2.
> > set size [file size "/tmp/X"]
> > set fd [open "/tmp/X" r]
> > set data [read $fd $size]
> > close $fd
> >
> > real 0m0.36s
> > user 0m0.25s
> > sys 0m0.11s
> >
> > where /tmp/X: -rw-r--r-- 1 itsadm other 4367393 May 26 20:53 /tmp/X
> >
> > Maybe this "trick" ic common knowledge but I sure wasn't aware of it; I've
> > just recoded a smallish app (a log file parser) in C because of the results
> > in 1. -- I'll now reconsider that drastic act.
> >
> > c.l.t. strikes again! -- thanks
> >
>
> Anyone prepared to comment why this should be so significant ?

Well - I guess we are witnessing the two extremities of a spectrum,
namely the quantitative aspects of 'growing a string'.
Again I'm only guessing but it looks like the single-arg [read] grows
its result just like [append var smallstring] would (i.e. many small
reallocs, ending up in brk()); while clearly a bigger chunksize would be
in order. The question is: is it an overlook or is there some
intelligent tradeoff here, designed to cope to yet another situation I'm
not thinking of ?

In the meantime,

proc readbybigchunks fd {
set res {}
while {![eof $fd]} {
append res [read $fd 2000000] ;# 2 Megs...
}
set res
}

seems satisfactory.
By the way, my initial reaction was [read $fd 2Gigs!!!] (without even a
loop in most cases), but this Bus-Errors on my 8.0.5 on Solaris. I guess
(again) the two-arg [read] is also rather naively implemented, in that
it brutally tries to allocate the requested size...

Boys, think of all the intelligence put behind readlines() in Python...

-Alex

Paul Duffin

unread,
May 27, 1999, 3:00:00 AM5/27/99
to

Maybe channels should be able to choose the size of the block that is
used by read. So that if you open a large file you use a large block
size and if you open a smaller file you use a smaller block size. A
socket channel (or a pipe) would obviously not be able to determine
the size of the data but it could have a configurable option to specify
block size.

What is the size of the block that read uses, is it the buffer size ?

Alexandre Ferrieux

unread,
May 27, 1999, 3:00:00 AM5/27/99
to

Hey, there is ! I'm just now realizing that the -buffersize fconfigure
option is also used for input...

> What is the size of the block that read uses, is it the buffer size ?

Nearly. From Tcl_SetChannelBufferSize.3:

Tcl_SetChannelBufferSize sets the size, in bytes, of buffers
that will be allocated in subsequent operations on the chan-
nel to store input or output. The size argument should be
between ten and one million, allowing buffers of ten bytes
to one million bytes. If size is outside this range,
Tcl_SetChannelBufferSize sets the buffer size to 4096.

(BTW, defaulting to 4096 when 1000001 is specified look like an April
1st joke)

-Alex

Chang LI

unread,
May 27, 1999, 3:00:00 AM5/27/99
to
Kai Harrekilde-Petersen wrote:
>

> >
> > set sz [file size yourFile]
> > set f [open yourFile r]
> > set data [read $f $sz]
> > close $f
> >
> > foreach {key value} [split $data \n] {
> > set yourArray($key) $value
> > }

It is better to be

set t [split $data \n]
foreach {key value} $t {
set yourArray($key) $value
}

> > set data {} ; unset data ; # one of the two commands should
> > # free the associated memory, but I don't remember which one
>

--

Bryan Oakley

unread,
May 27, 1999, 3:00:00 AM5/27/99
to
Chang LI wrote:

>
> Kai Harrekilde-Petersen wrote:
> > > foreach {key value} [split $data \n] {
> > > set yourArray($key) $value
> > > }
>
> It is better to be
>
> set t [split $data \n]
> foreach {key value} $t {
> set yourArray($key) $value
> }

Why is it better? It actually looks to be a tad bit slower since there
is an extra assignment that really doesn't buy you anything in this
scenario. Or am I missing something fundamental?

Jonas Beskow

unread,
May 27, 1999, 3:00:00 AM5/27/99
to

Andreas Kupries wrote:
>
> Jonas Beskow <bes...@fuzzy.ucsc.edu> writes:
>
> > Greetings,
>
> > We have a large dictionary (3MB, 130K lines), that is to be read
> > into an array (one line per element) When read line by line, the
> > whole thing takes about 70 seconds (on a 200MHz laptop). I also
> > tried using the zipchan extension, to read a zipped version of the
>
> Can you post an url for this one ? I would like to compare it against
> my Trf.

The zipchan I used is part of the CSLU Speech Toolkit
http://cslu.cse.ogi.edu/toolkit
You can download the entire toolkit, 36 MB, sourcecode is not included at
present, but will be eventually. I just corresponded with the author and he
refered to the zipchan package as presently falling into the "quick hack
category".

FYI I did an AltaVista search on zipchan and came up with
http://nestroy.wi-inf.uni-essen.de/wafe/ that contains another zipchan that I
haven't tried.

regards

Tom Poindexter

unread,
May 27, 1999, 3:00:00 AM5/27/99
to
Just to chime in with my 0.02: have you considered a database
approach? Instead of storing key-value pairs in an array, try using
a simple DBM-type database with Tcl interface. Pre-build your dbm
files, then use the dbm access to look up values.
pluses: zero startup overhead
minuses: additional time to lookup each value.

Your accesses will probably be in the millisecond range, rather than
microsecond range as with using an array. This may or may not be
fast enough depending on your application.
You can always cache values back into an array for subsequent accesses.

Several suitable DBM interfaces exist; check the FAQ.

Lacking that, I'd recommend the approach to format your data file as
a Tcl 'array set' command. Easy enough to create, a simple 'source'
command loads the data.
--
Tom Poindexter
tpoi...@nyx.net
http://www.nyx.net/~tpoindex/

Jonas Beskow

unread,
May 27, 1999, 3:00:00 AM5/27/99
to
Just as a side note, I ran the same script under 8.1, with the follwing results:
* The gets and read-based methods are slower in 8.1 than in 8.0
* The read/split-based method is faster than the
gets-based version under 8.1 (but still slower than 8.0 see above)
* the source-based methods are *faster* in 8.1 than in 8.0.
In 8.1 source/array set is 10X faster than gets (5X under 8.0)

Here are the numbers:

Tcl 8.0.2
---------
method: gets
108937 elements read in 23516000 us = 4632.46300391 elements/second
method: read and split
108937 elements read in 36000000 us = 3026.02777778 elements/second
method: source and set
107968 elements read in 7469000 us = 14455.4826617 elements/second


method: source and array set

107968 elements read in 5515000 us = 19577.1532185 elements/second


method: source and array set (sorted)

108937 elements read in 5797000 us = 18791.9613593 elements/second

Tcl 8.1.1
---------
method: gets
108937 elements read in 43781000 us = 2488.22548594 elements/second
method: read and split
108937 elements read in 38765000 us = 2810.18960402 elements/second
method: source and set
107968 elements read in 4250000 us = 25404.2352941 elements/second


method: source and array set

107968 elements read in 4204000 us = 25682.2074215 elements/second


method: source and array set (sorted)

108937 elements read in 4422000 us = 24635.2329263 elements/second


The above was executed on a dual 400Mhz PII with 512 MB RAM running WinNT

Regards
- Jonas

> - Jonas
>
> --
> Jonas Beskow
> Perceptual Science Laboratory || Centre for Speech Technology
> University of Calif, Santa Cruz || KTH
> bes...@fuzzy.ucsc.edu || bes...@speech.kth.se
>

> --------------------------------------------------------------------------------
> Name: readtest.tcl
> readtest.tcl Type: TCL Program (application/x-tcl)
> Encoding: 7bit

Chang LI

unread,
May 27, 1999, 3:00:00 AM5/27/99
to
Tom Poindexter wrote:
>
> Just to chime in with my 0.02: have you considered a database
> approach? Instead of storing key-value pairs in an array, try using
> a simple DBM-type database with Tcl interface. Pre-build your dbm
> files, then use the dbm access to look up values.
> pluses: zero startup overhead
> minuses: additional time to lookup each value.
>

That is good to have a DBMS. But sometimes the task is too simple to
use a database. I just wondering why the performance of array can not
archieve the level of DBMS.

> Your accesses will probably be in the millisecond range, rather than
> microsecond range as with using an array. This may or may not be
> fast enough depending on your application.
> You can always cache values back into an array for subsequent accesses.
>
> Several suitable DBM interfaces exist; check the FAQ.
>
> Lacking that, I'd recommend the approach to format your data file as
> a Tcl 'array set' command. Easy enough to create, a simple 'source'
> command loads the data.
> --
> Tom Poindexter
> tpoi...@nyx.net
> http://www.nyx.net/~tpoindex/

--

Chang LI

unread,
May 27, 1999, 3:00:00 AM5/27/99
to
Bryan Oakley wrote:
>
> Chang LI wrote:
> >
> > Kai Harrekilde-Petersen wrote:
> > > > foreach {key value} [split $data \n] {
> > > > set yourArray($key) $value
> > > > }
> >
> > It is better to be
> >
> > set t [split $data \n]
> > foreach {key value} $t {
> > set yourArray($key) $value
> > }
>

You are right. There is very little difference in speed.
But the later cost more memory. I'd to say I post too quickly.

> Why is it better? It actually looks to be a tad bit slower since there
> is an extra assignment that really doesn't buy you anything in this
> scenario. Or am I missing something fundamental?

--

Jonas Beskow

unread,
May 27, 1999, 3:00:00 AM5/27/99
to

Chang LI wrote:
>
> Tom Poindexter wrote:
> >
> > Just to chime in with my 0.02: have you considered a database
> > approach? Instead of storing key-value pairs in an array, try using
> > a simple DBM-type database with Tcl interface. Pre-build your dbm
> > files, then use the dbm access to look up values.
> > pluses: zero startup overhead
> > minuses: additional time to lookup each value.
> >
>
> That is good to have a DBMS. But sometimes the task is too simple to
> use a database.

this is indeed the case in my situation, a database sounds like overkill... also
I'm doing thousands of lookups at a time so access time is an issue.

> > Lacking that, I'd recommend the approach to format your data file as
> > a Tcl 'array set' command. Easy enough to create, a simple 'source'
> > command loads the data.

This is what I ended up doing, works great. WAY faster than gets or read (se
previous posts)

Paul Duffin

unread,
May 28, 1999, 3:00:00 AM5/28/99
to
Bryan Oakley wrote:
>
> Chang LI wrote:
> >
> > Kai Harrekilde-Petersen wrote:
> > > > foreach {key value} [split $data \n] {
> > > > set yourArray($key) $value
> > > > }
> >
> > It is better to be
> >
> > set t [split $data \n]
> > foreach {key value} $t {
> > set yourArray($key) $value
> > }
>
> Why is it better? It actually looks to be a tad bit slower since there
> is an extra assignment that really doesn't buy you anything in this
> scenario. Or am I missing something fundamental?

It would be better in terms of memory usage if he added [unset data]
between the assignment to t and the loop and [unset t] after the
loop. Otherwise by the time you have finished the loop you have three
copies of the data.
1. $data
2. $t
3. yourArray. (a copy of the key, a reference to the value).

Andreas Kupries

unread,
May 29, 1999, 3:00:00 AM5/29/99
to
Paul Duffin <pdu...@mailserver.hursley.ibm.com> writes:

> Bryan Oakley wrote:
> >
> > Chang LI wrote:
> > >
> > > Kai Harrekilde-Petersen wrote:
> > > > > foreach {key value} [split $data \n] {
> > > > > set yourArray($key) $value
> > > > > }
> > >
> > > It is better to be
> > >
> > > set t [split $data \n]
> > > foreach {key value} $t {
> > > set yourArray($key) $value
> > > }
> >
> > Why is it better? It actually looks to be a tad bit slower since there
> > is an extra assignment that really doesn't buy you anything in this
> > scenario. Or am I missing something fundamental?
>
> It would be better in terms of memory usage if he added [unset data]
> between the assignment to t and the loop and [unset t] after the
> loop. Otherwise by the time you have finished the loop you have three
> copies of the data.
> 1. $data
> 2. $t
> 3. yourArray. (a copy of the key, a reference to the value).

Uggh. I missed that. Thanks.

Alternative, using Donal's K proc (proc K {x y} {set x})

foreach {key value} [split [K $data [unset data]] \n] {
set yourArray($key) $value

Andreas Kupries

unread,
May 29, 1999, 3:00:00 AM5/29/99
to

Correct, see below.

> > reallocs, ending up in brk()); while clearly a bigger chunksize would be
> > in order. The question is: is it an overlook or is there some
> > intelligent tradeoff here, designed to cope to yet another situation I'm
> > not thinking of ?
> >
> > In the meantime,
> >
> > proc readbybigchunks fd {
> > set res {}
> > while {![eof $fd]} {
> > append res [read $fd 2000000] ;# 2 Megs...
> > }
> > set res
> > }
> >
> > seems satisfactory.

> > By the way, my initial reaction was [read $fd 2Gigs!!!] (without even a
> > loop in most cases), but this Bus-Errors on my 8.0.5 on Solaris. I guess
> > (again) the two-arg [read] is also rather naively implemented, in that
> > it brutally tries to allocate the requested size...

Correct, see below.

> Maybe channels should be able to choose the size of the block that is
> used by read. So that if you open a large file you use a large block
> size and if you open a smaller file you use a smaller block size. A
> socket channel (or a pipe) would obviously not be able to determine
> the size of the data but it could have a configurable option to specify
> block size.

> What is the size of the block that read uses, is it the buffer size ?

Yes.


From generic/tclIOCmd.c (Tcl_ReadObjCmd):
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
bufSize = Tcl_GetChannelBufferSize(chan);

/*
* If the caller specified a maximum length to read, then that is
* a good size to preallocate.
*/

if ((toRead != INT_MAX) && (toRead > bufSize)) {
Tcl_SetObjLength(resultPtr, toRead);
}

for (charactersRead = 0; charactersRead < toRead; ) {
toReadNow = toRead - charactersRead;
if (toReadNow > bufSize) {
toReadNow = bufSize;
}

/*
* NOTE: This is a NOOP if we set the size (above) to the
* number of bytes we expect to read. In the degenerate
* case, however, it will grow the buffer by the channel
* buffersize, which is 4K in most cases. This will result
* in inefficient copying for large files. This will be
* fixed in a future release.
*/

Tcl_SetObjLength(resultPtr, charactersRead + toReadNow);

charactersReadNow =
Tcl_Read(chan, Tcl_GetStringFromObj(resultPtr, NULL)
+ charactersRead, toReadNow);
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Andreas Kupries

unread,
May 29, 1999, 3:00:00 AM5/29/99
to

Jonas Beskow <bes...@fuzzy.ucsc.edu> writes:
> Andreas Kupries wrote:
> >
> > Jonas Beskow <bes...@fuzzy.ucsc.edu> writes:

> > > Greetings,
> >
> > > We have a large dictionary (3MB, 130K lines), that is to be read
> > > into an array (one line per element) When read line by line, the
> > > whole thing takes about 70 seconds (on a 200MHz laptop). I also
> > > tried using the zipchan extension, to read a zipped version of the
> >
> > Can you post an url for this one ? I would like to compare it against
> > my Trf.

I should say that Trf is at
http://www.oche.de/~akupries/soft/trf/index.html

and contains a 'zip' command. 'zip'ping channels is possible too, but
only if the core is patched (patch is distributed with the package,
for various versions of Tcl). Trf basically generalizes 'zipchan' and
similar code into a generic filtering framework for channels. I still
hope for the inclusion of the necessary patch into the core.

> The zipchan I used is part of the CSLU Speech Toolkit
> http://cslu.cse.ogi.edu/toolkit You can download the entire toolkit,
> 36 MB

Ah, no. This is a little to big for my small modem connection
(max. speed between 3-4 K/sec, if lucky), even if via cron done during
the night (lowest phone costs here in germany). I'll take a look at
that page nevertheless.

> sourcecode is not included at present, but will be eventually.

I add my hope to yours.

> I just corresponded with the author and he refered to the zipchan
> package as presently falling into the "quick hack category".

I see.


> FYI I did an AltaVista search on zipchan and came up with
> http://nestroy.wi-inf.uni-essen.de/wafe/ that contains another zipchan that I
> haven't tried.

Did the same later on and found that one too.

William Donovan

unread,
May 30, 1999, 3:00:00 AM5/30/99
to
Is it possible to use mmap to back a tcl array?
I have been in need not only of speed but also
of persistence for the array. Using mmap might
provide both. Am I crazy?
Bill Donovan


Alexandre Ferrieux

unread,
May 31, 1999, 3:00:00 AM5/31/99
to

If all you want is a mmaped hash, what about MetaKit ?

-Alex

Andreas Kupries

unread,
May 31, 1999, 3:00:00 AM5/31/99
to

Alexandre Ferrieux <alexandre...@cnet.francetelecom.fr> writes:
> Paul Duffin wrote:

>> Maybe channels should be able to choose the size of the block that
>> is used by read. So that if you open a large file you use a large
>> block size and if you open a smaller file you use a smaller block
>> size. A socket channel (or a pipe) would obviously not be able to
>> determine the size of the data but it could have a configurable
>> option to specify block size.

> Hey, there is ! I'm just now realizing that the -buffersize
> fconfigure option is also used for input...

>> What is the size of the block that read uses, is it the buffer size
>> ?

> Nearly. From Tcl_SetChannelBufferSize.3:

> Tcl_SetChannelBufferSize sets the size, in bytes, of buffers
> that will be allocated in subsequent operations on the chan-
> nel to store input or output. The size argument should be
> between ten and one million, allowing buffers of ten bytes to
> one million bytes. If size is outside this range,
> Tcl_SetChannelBufferSize sets the buffer size to 4096.

> (BTW, defaulting to 4096 when 1000001 is specified look like an
> April 1st joke)

What I am interested in, why is the minimum size restricted to 10
bytes instead of 1 ?

This choice looks a bit arbitrary to me.

WANGNICK Sebastian

unread,
Jun 1, 1999, 3:00:00 AM6/1/99
to
Andreas Kupries wrote:

> Alexandre Ferrieux <alexandre...@cnet.francetelecom.fr> writes:
> > Tcl_SetChannelBufferSize sets the size, in bytes, of buffers
> > that will be allocated in subsequent operations on the chan-
> > nel to store input or output. The size argument should be
> > between ten and one million, allowing buffers of ten bytes to
> > one million bytes. If size is outside this range,
> > Tcl_SetChannelBufferSize sets the buffer size to 4096.
>
> What I am interested in, why is the minimum size restricted to 10
> bytes instead of 1 ?

Or even 0 for binary-translated channels, so that you can properly mix
your read/gets with exec/open| <@ (which is impossible for the moment)!
--
Dipl.-Inform. Sebastian <dot> Wangnick <at eurocontrol in be>
Office: Eurocontrol Maastricht UAC, Horsterweg 11, NL-6191RX Beek,
Tel: +31-433661370, Fax: ~300
Spam email is reported (charge $100) to providers and U...@FTC.GOV.

Alexandre Ferrieux

unread,
Jun 1, 1999, 3:00:00 AM6/1/99
to
WANGNICK Sebastian wrote:
>
> Andreas Kupries wrote:
> > Alexandre Ferrieux <alexandre...@cnet.francetelecom.fr> writes:
> > > Tcl_SetChannelBufferSize sets the size, in bytes, of buffers
> > > that will be allocated in subsequent operations on the chan-
> > > nel to store input or output. The size argument should be
> > > between ten and one million, allowing buffers of ten bytes to
> > > one million bytes. If size is outside this range,
> > > Tcl_SetChannelBufferSize sets the buffer size to 4096.
> >
> > What I am interested in, why is the minimum size restricted to 10
> > bytes instead of 1 ?
>
> Or even 0 for binary-translated channels, so that you can properly mix
> your read/gets with exec/open| <@ (which is impossible for the moment)!

Yes. The unbuffered-input quest. Please notice that if the feature comes
into existence someday, an even more natural way of requesting it would
be '-buffering none' (instead of '-buffersize 0').

Notice also that there's still an unwanted symmetry between reading and
writing here: for bidirectional channels, I believe it would be VERY
cool to be able to set the buffering behavior independently on the r and
w side. Same for [close]. As soon as Jeff lands in Pacific Time, I'll
bug him again to ask for news about this 'half-close' request.

-Alex

Cary O'Brien

unread,
Jun 1, 1999, 3:00:00 AM6/1/99
to
In article <374C4F92...@fuzzy.ucsc.edu>,

Jonas Beskow <bes...@fuzzy.ucsc.edu> wrote:
>Greetings,
>
>We have a large dictionary (3MB, 130K lines), that is to be read into an array
>(one line per element) When read line by line, the whole thing takes about 70
>seconds (on a 200MHz laptop). I also tried using the zipchan extension, to read
>a zipped version of the dictionary (840 KB) directly. This shaved off 10
>seconds.
>Now, if it would speed things up, it would be OK to do a little pre-processing
>of the dictinary file, i.e. it could be formatted and stored as tcl-code,
>something like:
>
>array set a {
> key1 val1
> key2 val2
> ...
>}
>
>or
>set a(key1) val1
>set a(key2) val2
>...
>
>or something completly different. Also, would it be significantly faster to
>write a C-extension that just reads lines from a file and builds up the array
>(using Tcl_SetVar2, effectively bypassing the interpreter) or is this overkill?
>
>If there is no (significantly) faster way of doing it, then it's fine. I just
>want to be sure we're doing the optimal thing. We're using tcl 8.02 on Windows
>

How about array set/get and read/write_file from tclx?

# dump the file
write_file fred.dat [array get fred]

# read it back
array set fred [read_file fred.dat]

Must be better than all those sets, eh?

-- cary


>Thanks for any input

Bryan Oakley

unread,
Jun 1, 1999, 3:00:00 AM6/1/99
to
William Donovan wrote:
>
> Is it possible to use mmap to back a tcl array?
> I have been in need not only of speed but also
> of persistence for the array. Using mmap might
> provide both. Am I crazy?
> Bill Donovan

I would think it would be quite possible. Of course, you'll have to
write some C code, and the code won't be particularly portable. But it
should be possible.

--
Bryan Oakley mailto:oak...@channelpoint.com
ChannelPoint, Inc. http://purl.oclc.org/net/oakley

Donal K. Fellows

unread,
Jun 4, 1999, 3:00:00 AM6/4/99
to
In article <374CF98B...@fuzzy.ucsc.edu>,

Jonas Beskow <bes...@fuzzy.ucsc.edu> wrote:
> I tried this. Oddly enough read/split is about 50% slower than gets
> in my case! As you hinted at, however, source'ing a file with one
> big array set {....} is by far the fastest (about 6 times faster
> than read/split), here are some benchmarks I arrived at:

The fastest method of all I find to be a simple array dump/undump:

proc dumpArray {aryName filename} {
upvar $aryName ary
set f [open $filename w]
puts $f [array get ary]
close $f
}

proc undumpArray {aryName filename} {
upvar $aryName ary
set f [open $filename r]
# The [file size] trick makes a big difference
array set ary [read $f [file size $filename]]
close $f
}

This is faster (though less flexible) than just sourcing a script as
it doesn't ram several megs of data through the compiler/interpreter
while keeping the reference counts down really low to keep the number
of copy operations as small as possible. The dumping operation is
even faster.

(Not posting the timing results, since they are affected by other
factors which make them non-comparable. <sigh>)

Donal.
--
Donal K. Fellows http://www.cs.man.ac.uk/~fellowsd/ fell...@cs.man.ac.uk
-- The small advantage of not having California being part of my country would
be overweighed by having California as a heavily-armed rabid weasel on our
borders. -- David Parsons <o r c @ p e l l . p o r t l a n d . o r . u s>

Darren New

unread,
Jun 4, 1999, 3:00:00 AM6/4/99
to
Donal K. Fellows wrote:
> # The [file size] trick makes a big difference

Why not incorporate that change into the core, then?

--
Darren New / Senior Software Architect / MessageMedia, Inc.
San Diego, CA, USA (PST). Cryptokeys on demand.
Help outlaw Dihydrogen monoxide, a major component of acid rain!

Pascal Bouvier

unread,
Jun 7, 1999, 3:00:00 AM6/7/99
to
"Donal K. Fellows" wrote:
>
> In article <374CF98B...@fuzzy.ucsc.edu>,
> Jonas Beskow <bes...@fuzzy.ucsc.edu> wrote:
> > I tried this. Oddly enough read/split is about 50% slower than gets
> > in my case! As you hinted at, however, source'ing a file with one
> > big array set {....} is by far the fastest (about 6 times faster
> > than read/split), here are some benchmarks I arrived at:
>
> The fastest method of all I find to be a simple array dump/undump:
>
> proc dumpArray {aryName filename} {
> upvar $aryName ary
> set f [open $filename w]
> puts $f [array get ary]
> close $f
> }
>
> proc undumpArray {aryName filename} {
> upvar $aryName ary
> set f [open $filename r]
> # The [file size] trick makes a big difference
> array set ary [read $f [file size $filename]]
> close $f
> }
>
> This is faster (though less flexible) than just sourcing a script as
> it doesn't ram several megs of data through the compiler/interpreter
> while keeping the reference counts down really low to keep the number
> of copy operations as small as possible. The dumping operation is
> even faster.
>
> (Not posting the timing results, since they are affected by other
> factors which make them non-comparable. <sigh>)
>
> Donal.
> --

I would add that putting a "fconfigure -buffersize" statement with a big size
in proc "undumpArray", between "open" and "read" should greatly increase speed
(untested/untimed):
fconfigure $f -buffersize 1000000
(unless this is automagically done with the presence of the size argument of "read")

--
Pascal

Chang LI

unread,
Jun 7, 1999, 3:00:00 AM6/7/99
to
Pascal Bouvier wrote:
>

one problem to allocate a big size buffer is to cause "no enough memory
error".
You may have enough memory in total but a single block may be small.

> I would add that putting a "fconfigure -buffersize" statement with a big size
> in proc "undumpArray", between "open" and "read" should greatly increase speed
> (untested/untimed):
> fconfigure $f -buffersize 1000000
> (unless this is automagically done with the presence of the size argument of "read")
>
> --
> Pascal

--

sandh...@my-deja.com

unread,
Jun 18, 1999, 3:00:00 AM6/18/99
to
A technique I have found useful to improve load
times on large files is to concatenate multiple
lines before doing the read.
for example:
cat largefile | paste - - - - | reader_proc.tcl
The "reader_proc.tcl" would expect multi-line
reads, with each "row" separated by tabs.
I've hit the wall on some of the "commercial"
Unixes, with Solaris dying on a maximum of twelve
arg's to the paste command. I've been able to put
AIX 4.2.1 as large as 128 args to the paste
command, which would result in concatenating 128
"rows" and then passed to a single read (the
reader_proc.tcl program).

In article <m3u2szf...@bluepeak.westend.com>,


Andreas Kupries <a.ku...@westend.com> wrote:
> Jonas Beskow <bes...@fuzzy.ucsc.edu> writes:
>

> > Greetings,
>
> > We have a large dictionary (3MB, 130K lines),
that is to be read
> > into an array (one line per element) When
read line by line, the
> > whole thing takes about 70 seconds (on a
200MHz laptop). I also
> > tried using the zipchan extension, to read a
zipped version of the
>

> Can you post an url for this one ? I would like
to compare it against
> my Trf.
>

> > dictionary (840 KB) directly. This shaved off
10 seconds. Now, if
> > it would speed things up, it would be OK to
do a little
> > pre-processing of the dictinary file, i.e. it
could be formatted and
> > stored as tcl-code, something like:
>
> > array set a {
> > key1 val1
> > key2 val2
> > ...
> > }
>
> > or
> > set a(key1) val1
> > set a(key2) val2
> > ...
>

> One advice at
>
>
http://purl.org/thecliff/tcl/wiki/TclPerformance
>
> is to 'time' it. So, convert your file to the
format to test, then do
>
> puts [time {source yourFile} 1]
>
> to check out the speed.


>
> Remark: One trick to remember for reading large
files is to determine
> their size beforehand and then to tell 'read'
that number, allowing
> the command to do a better allocation of channel
buffers. So:
>

> set sz [file size yourFile]
> set f [open yourFile r]
> set data [read $f $sz]
> close $f
>

> foreach {key value} [split $data \n] {
> set yourArray($key) $value
> }

> set data {} ; unset data ; # one of the two
commands should
> # free the associated memory, but I don't
remember which one
>

> --
> Sincerely,
> Andreas Kupries <a.ku...@westend.com>
> <http://www.westend.com/~kupries/>
>
-------------------------------------------------------------------------------
>

Sent via Deja.com http://www.deja.com/
Share what you know. Learn what you don't.

Reply all
Reply to author
Forward
0 new messages