Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

How to read lots of data from disk quickly?

8 views
Skip to first unread message

Bruce L. Lambert, Ph.D.

unread,
Apr 19, 1996, 3:00:00 AM4/19/96
to
In the course of my information retrieval experiments, I often need to
read large data files in from disk. I've got a couple of functions that
do this successfully, but they seem awful damn slow to me. They also seem
really convoluted for what should be a more straightforward task. Of
course, I am a graduate of the Rube Goldberg school of programming. I
never saw a superfluous step I didn't like.

In any case, here's a function to read in the indexes and values from a
sparse array into a hash table. Is there a more efficient way of going
about this.

(defun read-nonzero-lower-diagonal (fname)
(print "Loading similarity matrix...")
(let ((row-index 0)
(column-index 0)
(current-sim-value 0)
(sim-table (make-hash-table :test #'equal)))
(with-open-file (istream fname :direction :input)
(loop
(when (null (peek-char t istream nil nil)) (return))
(setf row-index (read istream nil nil))
(setf column-index (read istream nil nil))
(setf current-sim-value (read istream nil nil))
(setf (gethash (cons row-index column-index) sim-table)
current-sim-value)))
(print "Done loading similarity matrix")
sim-table))

This next one is supposed to read each line from a text file and return
it as a list of symbols (not as a string). It seems almost unbelievably
convoluted to me. Is there a better way?

(defun read-collection (filename)
(with-open-file (istream filename :direction :input)
(let ((collection (make-array 0 :adjustable t :fill-pointer t))
(current-clause '()))
(loop
(cond ((null (peek-char nil istream nil nil)) (return))
;;when next char is end of file return
((char= (peek-char nil istream) #\newline)
;;when next char is newline
(read-char istream)
;;pluck char off of istream
(vector-push-extend current-clause collection)
;;put the current clause on the end of the collection
(setf current-clause '()))
;;clear the contents of current clause
(t (loop
(if (or (null (peek-char nil istream nil nil))
;;if next char is end of file
(char= (peek-char nil istream) #\newline))
;;or newline
(return)
;;return
;;else push the next word onto the end of the
current clause
(push-end (read-preserving-whitespace istream
current-clause))))))
collection)))

The previous function uses a macro called push-end that a friend of mine
wrote. I's mighty convenient.

(defmacro push-end (object list-variable)
`(setf ,list-variable
(cond ((consp ,list-variable)
(nconc ,list-variable (list ,object)))
((null ,list-variable) (list ,object))
(t (format t "ERROR -- PUSH-END was given invalid
arguments")))))

Benjamin Shults

unread,
Apr 19, 1996, 3:00:00 AM4/19/96
to Bruce, L., Lambert, Ph.D.
Bruce, L., Lambert, Ph.D. wrote:
>
> In the course of my information retrieval experiments, I often need to
> read large data files in from disk. I've got a couple of functions that
> do this successfully, but they seem awful damn slow to me. They also seem
> really convoluted for what should be a more straightforward task. Of
> course, I am a graduate of the Rube Goldberg school of programming. I
> never saw a superfluous step I didn't like.

This should make some difference on the first one:

(defun read-nonzero-lower-diagonal (fname)
(print "Loading similarity matrix...")

(let ((sim-table (make-hash-table :test #'equal)))


(with-open-file (istream fname :direction :input)
(loop
(when (null (peek-char t istream nil nil)) (return))

(setf (gethash (cons (read istream nil nil) (read istream nil nil)) sim-table)
(read istream nil nil)))


(print "Done loading similarity matrix")
sim-table))

On the second one, I would probably use a read-line to get a string then read-from-string.
I'm not sure that would be faster but it would be less convoluted.

It has seemed to me that Lisp's file reading functions are slow. This
probably depends on the Lisp implementation. If speed is important, you
might try another implementation.

--
Benjamin Shults Email: bsh...@math.utexas.edu
Department of Mathematics Phone: (512) 471-7711 ext. 208
University of Texas at Austin WWW: http://www.ma.utexas.edu/users/bshults
Austin, TX 78712 USA FAX: (512) 471-9038 (attn: Benjamin Shults)

Pete Grant

unread,
Apr 20, 1996, 3:00:00 AM4/20/96
to
On Apr 19, 1996 17:40:14 in article <How to read lots of data from disk

quickly?>, 'Bruce L. Lambert, Ph.D. <lamb...@uic.edu>' wrote:


>In the course of my information retrieval experiments, I often need to
>read large data files in from disk. I've got a couple of functions that
>do this successfully, but they seem awful damn slow to me. They also seem
>really convoluted for what should be a more straightforward task. Of
>course, I am a graduate of the Rube Goldberg school of programming. I
>never saw a superfluous step I didn't like.
>
>In any case, here's a function to read in the indexes and values from a
>sparse array into a hash table. Is there a more efficient way of going
>about this.
>
>(defun read-nonzero-lower-diagonal (fname)
>(print "Loading similarity matrix...")
>(let ((row-index 0)
>(column-index 0)
>(current-sim-value 0)
>(sim-table (make-hash-table :test #'equal)))
>(with-open-file (istream fname :direction :input)
>(loop
>(when (null (peek-char t istream nil nil)) (return))
>(setf row-index (read istream nil nil))
>(setf column-index (read istream nil nil))
>(setf current-sim-value (read istream nil nil))
>(setf (gethash (cons row-index column-index) sim-table)
>current-sim-value)))
>(print "Done loading similarity matrix")
>sim-table))
>

The answer depends on how much control you have over matters,
and how hard you are willing to work to speed things up.
For example, a significant amount of time is expended in
extracting numerical data from ASCII character strings. If
you could store the numbers in binary form, much time could
be saved by not doing the conversion as well as eliminating
the inherent inefficiencies in general-purpose routines such
as READ.

If external constraints dictate text mode data, could you
at least write it in formatted; i.e., column-sensitive form?
If so, you could save some time by using READ-LINE and your
own custom integer extractors that convert text into numeric
from specified columns. I have written some that I could share.
An example of using one such would be:
(loop as data-line = (read-line infile nil nil)
while data-line
as row = (extract-integer data-line 0 6) ;; cols 1-6
as col = (extract-integer data-line 7 13) ;; cols 8-13
as val = (extract-integer data-line 14 20)
do
(setf (gethash (cons row col) sim-table) val))

If you have no control over any of this, there's not much
you can do. (Somebody else's suggestion of eliminating
the intermediate variables will not make a noticeable
difference). You could easily eliminate the PEEK-CHAR call
but its effect is implementation-dependent and probably
will not make a difference worth beans.

- - - - -
Now here, a buch of improvements can be obtained; and,
the 'convolution' can definitely be diminshed.

First, if you have some idea of how big the array is
going to be, make it that size and save the overhead of
multiple GROW-ARRAY calls that are being made 'behind the
scenes'. If the range of values is great, waste some
space and initialize it to some reasonable size; e.g.,
1000. Alternately, and my favorite technique, is to
initially gather a list and then make an array out of
it -- see below for example.

Second, the PEEK-CHAR issue surfaces itself again. This
time, however, it's much more significant in that it
potentially gets called many, many times. The best (?
best is in the eyes of the beholder) method is to read
the entire line and process it in a more efficient manner.
This eliminates all of the peeking.

Third, although the PUSH-END works, it's potentially
somewhat inefficient. Each time you push an item to
the end of the list, you must traverse the list. If
your lists are lengthy, this adds up to a fair amount of
time. Better to build the list in reverse order, then
unreverse it just before storing into the array.

Here's how I would do it. The idea, as partially
outlined above, is to read a line at a time, creating
the list from the text line, then collecting a list
of lists. When done, build an array from the list.
It wastes a bit of cons space, but is potentially
a lot faster. Further improvements are also
possible, but at diminishing returns for the effort.
For example, I have a special read-line-into-reusable-
string that is noticeably faster, but only with large
files.

(defun build-array (&optional (filename "e:\\data\\foo.bar"))
(let ((list-of-lists nil)
(len 0)
arr)
(with-open-file (f filename)
(loop as line = (read-line f nil nil)
while line
as wordlist = (make-wordlist line)
do
(push wordlist list-of-lists)
(incf len)))
;; you could eliminate the len var and just
;; take the length of the list instead.
(setf arr (make-array len))
(loop for wlist in list-of-lists
for i from (- len 1) by -1
do
(setf (aref arr i) wlist))
arr) ;; return the array to caller

(defun make-wordlist (text)
(loop with pos = 0
and list = nil
do
(multiple-value-bind (word npos)
(read-from-string text nil nil :start pos)
(unless word
(return-from make-wordlist (nreverse list)))
(push word list)
(setf pos npos))))

Note: Tested on Allegro CL 3.0 on WinNT 3.51.


--
Pete Grant
Kalevi, Inc.
Software Engineering & development

Erik Naggum

unread,
Apr 21, 1996, 3:00:00 AM4/21/96
to
[Bruce L. Lambert, Ph.D.]

| In the course of my information retrieval experiments, I often need to
| read large data files in from disk. I've got a couple of functions
| that do this successfully, but they seem awful damn slow to me. They
| also seem really convoluted for what should be a more straightforward
| task.

(this is not an answer to your question, but it could become one.)

the ANSI CL function `read-sequence' (`write-sequence') will replace
successive objects in a sequence (file) with objects in the file
(sequence). the efficiency gains hinge on equality of the element-type of
the stream and the sequence. `open' is not able to portable open a file
with anything but element-types that are subtypes of character, finite
subtypes of integer, signed-byte or unsigned-byte, barring an unusual
interpretation of :default.

would it be a non-conforming extension to pass floating-point types as the
element-type to `open' so one could do the following?

(let* ((type 'double-float)
(vector (make-array <dimensions>
:element-type type
:initial-element (coerce 0 type))))
(with-open-file (stream <filespec>
:element-type type)
(read-sequence vector stream))
vector)

IMHO, if this is a non-conforming extension, we have a problem. if it is a
conforming extension, vendors should be encouraged to supply ways to allow
files to handle user-specified extended element-types. it would perhaps
make sense to achieve consensus on the interface to such extensions.

(should this be posted to comp.std.lisp? or is the group defunct?)

#<Erik>
--
errare umanum hest

0 new messages