On Apr 19, 1996 17:40:14 in article <How to read lots of data from disk
quickly?>, 'Bruce L. Lambert, Ph.D. <lambe...@uic.edu>' wrote:
>In the course of my information retrieval experiments, I often need to
>read large data files in from disk. I've got a couple of functions that
>do this successfully, but they seem awful damn slow to me. They also seem
>really convoluted for what should be a more straightforward task. Of
>course, I am a graduate of the Rube Goldberg school of programming. I
>never saw a superfluous step I didn't like.
>In any case, here's a function to read in the indexes and values from a
>sparse array into a hash table. Is there a more efficient way of going
>about this.
>(defun read-nonzero-lower-diagonal (fname)
>(print "Loading similarity matrix...")
>(let ((row-index 0)
>(column-index 0)
>(current-sim-value 0)
>(sim-table (make-hash-table :test #'equal)))
>(with-open-file (istream fname :direction :input)
>(loop
>(when (null (peek-char t istream nil nil)) (return))
>(setf row-index (read istream nil nil))
>(setf column-index (read istream nil nil))
>(setf current-sim-value (read istream nil nil))
>(setf (gethash (cons row-index column-index) sim-table)
>current-sim-value)))
>(print "Done loading similarity matrix")
>sim-table))
The answer depends on how much control you have over matters,
and how hard you are willing to work to speed things up.
For example, a significant amount of time is expended in
extracting numerical data from ASCII character strings. If
you could store the numbers in binary form, much time could
be saved by not doing the conversion as well as eliminating
the inherent inefficiencies in general-purpose routines such
as READ.
If external constraints dictate text mode data, could you
at least write it in formatted; i.e., column-sensitive form?
If so, you could save some time by using READ-LINE and your
own custom integer extractors that convert text into numeric
from specified columns. I have written some that I could share.
An example of using one such would be:
(loop as data-line = (read-line infile nil nil)
while data-line
as row = (extract-integer data-line 0 6) ;; cols 1-6
as col = (extract-integer data-line 7 13) ;; cols 8-13
as val = (extract-integer data-line 14 20)
do
(setf (gethash (cons row col) sim-table) val))
If you have no control over any of this, there's not much
you can do. (Somebody else's suggestion of eliminating
the intermediate variables will not make a noticeable
difference). You could easily eliminate the PEEK-CHAR call
but its effect is implementation-dependent and probably
will not make a difference worth beans.
- - - - -
>This next one is supposed to read each line from a text file and return
>it as a list of symbols (not as a string). It seems almost unbelievably
>convoluted to me. Is there a better way?
>(defun read-collection (filename)
>(with-open-file (istream filename :direction :input)
>(let ((collection (make-array 0 :adjustable t :fill-pointer t))
>(current-clause '()))
>(loop
>(cond ((null (peek-char nil istream nil nil)) (return))
>;;when next char is end of file return
>((char= (peek-char nil istream) #\newline)
>;;when next char is newline
>(read-char istream)
>;;pluck char off of istream
>(vector-push-extend current-clause collection)
>;;put the current clause on the end of the collection
>(setf current-clause '()))
>;;clear the contents of current clause
>(t (loop
>(if (or (null (peek-char nil istream nil nil))
>;;if next char is end of file
>(char= (peek-char nil istream) #\newline))
>;;or newline
>(return)
>;;return
>;;else push the next word onto the end of the
>current clause
>(push-end (read-preserving-whitespace istream
>current-clause))))))
>collection)))
>The previous function uses a macro called push-end that a friend of mine
>wrote. I's mighty convenient.
>(defmacro push-end (object list-variable)
>`(setf ,list-variable
>(cond ((consp ,list-variable)
>(nconc ,list-variable (list ,object)))
>((null ,list-variable) (list ,object))
>(t (format t "ERROR -- PUSH-END was given invalid
>arguments")))))
Now here, a buch of improvements can be obtained; and,
the 'convolution' can definitely be diminshed.
First, if you have some idea of how big the array is
going to be, make it that size and save the overhead of
multiple GROW-ARRAY calls that are being made 'behind the
scenes'. If the range of values is great, waste some
space and initialize it to some reasonable size; e.g.,
1000. Alternately, and my favorite technique, is to
initially gather a list and then make an array out of
it -- see below for example.
Second, the PEEK-CHAR issue surfaces itself again. This
time, however, it's much more significant in that it
potentially gets called many, many times. The best (?
best is in the eyes of the beholder) method is to read
the entire line and process it in a more efficient manner.
This eliminates all of the peeking.
Third, although the PUSH-END works, it's potentially
somewhat inefficient. Each time you push an item to
the end of the list, you must traverse the list. If
your lists are lengthy, this adds up to a fair amount of
time. Better to build the list in reverse order, then
unreverse it just before storing into the array.
Here's how I would do it. The idea, as partially
outlined above, is to read a line at a time, creating
the list from the text line, then collecting a list
of lists. When done, build an array from the list.
It wastes a bit of cons space, but is potentially
a lot faster. Further improvements are also
possible, but at diminishing returns for the effort.
For example, I have a special read-line-into-reusable-
string that is noticeably faster, but only with large
files.
(defun build-array (&optional (filename "e:\\data\\foo.bar"))
(let ((list-of-lists nil)
(len 0)
arr)
(with-open-file (f filename)
(loop as line = (read-line f nil nil)
while line
as wordlist = (make-wordlist line)
do
(push wordlist list-of-lists)
(incf len)))
;; you could eliminate the len var and just
;; take the length of the list instead.
(setf arr (make-array len))
(loop for wlist in list-of-lists
for i from (- len 1) by -1
do
(setf (aref arr i) wlist))
arr) ;; return the array to caller
(defun make-wordlist (text)
(loop with pos = 0
and list = nil
do
(multiple-value-bind (word npos)
(read-from-string text nil nil :start pos)
(unless word
(return-from make-wordlist (nreverse list)))
(push word list)
(setf pos npos))))
Note: Tested on Allegro CL 3.0 on WinNT 3.51.
--
Pete Grant
Kalevi, Inc.
Software Engineering & development