Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

When does ADJUST-ARRAY cons?

11 views
Skip to first unread message

Alain Picard

unread,
Jul 18, 2005, 12:12:25 AM7/18/05
to

I'm trying to read in a file, and do something to each
line. Speed is of the essence.

Here is our test file:
/tmp % wc test.big
348456 1742280 16868376 test.big


As a benchmark, I tried the obvious:

(time
(with-open-file (strm "/tmp/test.big")
(loop as line = (read-line strm nil nil)
while line
summing (length line))))

;; Timing the evaluation of (WITH-OPEN-FILE (STRM "/tmp/test.big") (LOOP AS LINE = (READ-LINE STRM NIL NIL) WHILE LINE SUMMING (LENGTH LINE)))
;; user time = 2.554
;; system time = 0.033
;; Elapsed time = 0:00:02
;; Allocation = 27367520 bytes standard / 7666439 bytes conses
;; 0 Page faults

Okay, next attempt was to reuse the buffer. I tried:

(defconstant +max-buflen+ 102400) ; make larger later, if that helps
(defparameter *buffer* (make-array +max-buflen+ :element-type 'standard-char))

(time
(with-open-file (strm "/tmp/test.big")
(loop as count = (read-sequence *buffer* strm)
while (plusp count)
summing count)))
Timing the evaluation of (WITH-OPEN-FILE (STRM "/tmp/test.big") (LOOP AS COUNT = (READ-SEQUENCE *BUFFER* STRM) WHILE (PLUSP COUNT) SUMMING COUNT))
user time = 0.074
system time = 0.034
Elapsed time = 0:00:01
Allocation = 26880 bytes standard / 4037 bytes conses
0 Page faults
16868376

Much better! Now all I need to do is to move along a second sequence
to the position of the next "logical" line, i.e. where the next
#\Newline is found.

e.g.

(defparameter *pos* 0)
(defparameter *line* nil)
(defconstant +base-line-length+ 120)
(defparameter *base-line* (make-array 0
:element-type 'standard-char
:adjustable t
:displaced-to *buffer*
:displaced-index-offset 0))

(defun load-line-2 (strm)
(flet ((reset ()
(setf *pos* 0)
(unless (plusp (read-sequence *buffer* strm))
(error 'end-of-file))))
(let ((pos (position #\Newline *base-line*)))
(when (or (null pos)
(zerop pos)
(> (incf *pos* (1+ pos)) +max-buflen+))
(reset))
(when (>= (+ +base-line-length+ *pos*)
+max-buflen+)
(reset))
(setq *line* (adjust-array *base-line* +base-line-length+
:element-type 'standard-char
:displaced-to *buffer*
:displaced-index-offset *pos*)))))

(time
(with-open-file (strm "/tmp/test.big")
(ignore-errors
(dotimes (i 1000000)
(load-line-2 strm)))))
Timing the evaluation of (WITH-OPEN-FILE (STRM "/tmp/test.big") (IGNORE-ERRORS (DOTIMES (I 1000000) (LOAD-LINE-2 STRM))))

user time = 5.163
system time = 0.034
Elapsed time = 0:00:06
Allocation = 25888 bytes standard / 11371987 bytes conses
0 Page faults

NIL
#<END-OF-FILE 206ED0D4>
CL-USER>

Ugh! Now, never minding that this function doesn't do what I want,
it does not bode well that
a) it's slower than read-line
b) it's still consing 11 Megs.

I'm obviously confused here. I thought this code would adjust
*base-line* to point to a subpart of *buffer*, moving it along
until it was time to re-fill the *buffer* from the stream.


Is what I'm trying to do possible (in an efficient manner) ?

Thanks,
--ap


Edi Weitz

unread,
Jul 18, 2005, 8:24:36 AM7/18/05
to
On Mon, 18 Jul 2005 14:12:25 +1000, Alain Picard <Alain....@memetrics.com> wrote:

> I'm obviously confused here. I thought this code would adjust
> *base-line* to point to a subpart of *buffer*, moving it along until
> it was time to re-fill the *buffer* from the stream.

(defconstant +max-buflen+ 102400)
(defparameter *buffer* (make-array +max-buflen+ :element-type 'base-char))

(defconstant +base-line-length+ 120)
(defparameter *base-line* (make-array 0

:element-type 'base-char


:adjustable t
:displaced-to *buffer*
:displaced-index-offset 0))

(defun test (&optional (length 1000))
(loop with max = (- +max-buflen+ length)
repeat 100000 do
(adjust-array *base-line* length
:element-type 'base-char
:displaced-to *buffer*
:displaced-index-offset (random max))))

Measuring TEST from above seems to suggest that calling ADJUST-ARRAY
in LispWorks always conses 22 bytes no matter what the value of LENGTH
is. I see similar constant factors for AllegroCL and CMUCL. I also
wonder why this has to be the case but three major implementations
obviously use a similar strategy to implement ADJUST-ARRAY.

So, it looks as if your code really "just" adjusts *BASE-LINE* but
this operation isn't for free. Actually, I think this technique will
only buy you something if your lines are /very/ long - for short lines
using a simple string as the buffer and applying SUBSEQ might be just
as fast.

> Is what I'm trying to do possible (in an efficient manner) ?

If I just compare the simple approach

(defun foo ()


(with-open-file (strm "/tmp/test.big")
(loop as line = (read-line strm nil nil)
while line
summing (length line))))

with the obvious Perl program

perl -e 'while (<>) { $i += length }; print $i' < /tmp/test.big

on a 64 MB text file with about 1.5 million lines I get around 1.2
seconds in Lisp[1] and 0.9 seconds in Perl (5.8.7). Given that Perl
has a /very/ good reputation as far as fast I/O is concerned I think
this is quite OK.

I wouldn't mind to see improvements, though... :)

Cheers,
Edi.

[1] Results on Linux with LispWorks 4.4.5. In my tests both CMUCL and
AllegroCL were perceptibly slower, though.

--

Lisp is not dead, it just smells funny.

Real email: (replace (subseq "spam...@agharta.de" 5) "edi")

Pascal Bourguignon

unread,
Jul 18, 2005, 10:00:29 AM7/18/05
to
Alain Picard <Alain....@memetrics.com> writes:

> I'm trying to read in a file, and do something to each
> line. Speed is of the essence.
>
> Here is our test file:
> /tmp % wc test.big
> 348456 1742280 16868376 test.big

> (defparameter *pos* 0)


> (defparameter *line* nil)
> (defconstant +base-line-length+ 120)
> (defparameter *base-line* (make-array 0
> :element-type 'standard-char

The only characters in standard-char are:

!"#$%&'()*+,-./0123456789:;<=>?
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~
plus #\space and #\newline.

Are you sure you only have one of those in your test.big file?
No tabulation? No accent?

I'd rather use base-char or even character to avoid any risk.

>
> user time = 5.163
> system time = 0.034
> Elapsed time = 0:00:06
> Allocation = 25888 bytes standard / 11371987 bytes conses
> 0 Page faults
>
> NIL
> #<END-OF-FILE 206ED0D4>

Perhaps you should choose a better implementation?

With my test.big:
333228 2688828 16868376 /tmp/test.big

in clisp (not even compiled):

[3]> (time


(with-open-file (strm "/tmp/test.big")
(ignore-errors
(dotimes (i 1000000)
(load-line-2 strm)))))

Real time: 1.323033 sec.
Run time: 0.5 sec.
Space: 182664 Bytes
GC: 1, GC time: 0.09 sec.
NIL ;
#<END-OF-FILE #x204C3446>


and compiled:

[6]> (time


(with-open-file (strm "/tmp/test.big")
(ignore-errors
(dotimes (i 1000000)
(load-line-2 strm)))))

Real time: 1.073208 sec.
Run time: 0.39 sec.
Space: 113352 Bytes
NIL ;
#<END-OF-FILE #x2056769E>

--
__Pascal_Bourguignon__ _ Software patents are endangering
() ASCII ribbon against html email (o_ the computer industry all around
/\ 1962:DO20I=1.100 //\ the world http://lpf.ai.mit.edu/
2001:my($f)=`fortune`; V_/ http://petition.eurolinux.org/

Edi Weitz

unread,
Jul 18, 2005, 10:23:57 AM7/18/05
to
On Mon, 18 Jul 2005 16:00:29 +0200, Pascal Bourguignon <p...@informatimago.com> wrote:

> Perhaps you should choose a better implementation?

Did it occur to you that these timings might be affected by other
factors like hardware or the underlying OS?

If you insist, on my machine CLISP (2.33.2) needs more than twice as
long as LispWorks for the same file. So, by your childish measure of
"better" LispWorks is twice as good as CLISP, right?

Cheers,
Edi.

edi@miles:/tmp$ wc test.big
1644906 6266610 64931058 test.big
edi@miles:/tmp$ cat foo.lisp


(defconstant +max-buflen+ 102400) ; make larger later, if that helps

(defparameter *buffer* (make-array +max-buflen+ :element-type 'base-char))

(defparameter *pos* 0)
(defparameter *line* nil)
(defconstant +base-line-length+ 120)
(defparameter *base-line* (make-array 0

:element-type 'base-char


:adjustable t
:displaced-to *buffer*
:displaced-index-offset 0))

(defun load-line-2 (strm)
(flet ((reset ()
(setf *pos* 0)
(unless (plusp (read-sequence *buffer* strm))
(error 'end-of-file))))
(let ((pos (position #\Newline *base-line*)))
(when (or (null pos)
(zerop pos)
(> (incf *pos* (1+ pos)) +max-buflen+))
(reset))
(when (>= (+ +base-line-length+ *pos*)
+max-buflen+)
(reset))
(setq *line* (adjust-array *base-line* +base-line-length+

:element-type 'base-char
:displaced-to *buffer*
:displaced-index-offset *pos*)))))

(defun foo ()


(with-open-file (strm "/tmp/test.big")
(ignore-errors
(dotimes (i 1000000)
(load-line-2 strm)))))

edi@miles:/tmp$ clisp -E "ISO-8859-1"
i i i i i i i ooooo o ooooooo ooooo ooooo
I I I I I I I 8 8 8 8 8 o 8 8
I \ `+' / I 8 8 8 8 8 8
\ `-+-' / 8 8 8 ooooo 8oooo
`-__|__-' 8 8 8 8 8
| 8 o 8 8 o 8 8
------+------ ooooo 8oooooo ooo8ooo ooooo 8

Copyright (c) Bruno Haible, Michael Stoll 1992, 1993
Copyright (c) Bruno Haible, Marcus Daniels 1994-1997
Copyright (c) Bruno Haible, Pierpaolo Bernardi, Sam Steingold 1998
Copyright (c) Bruno Haible, Sam Steingold 1999-2000
Copyright (c) Sam Steingold, Bruno Haible 2001-2004


[1]> (load (compile-file "foo.lisp"))

Compiling file /tmp/foo.lisp ...

Wrote file /tmp/foo.fas
0 errors, 0 warnings
;; Loading file /tmp/foo.fas ...
;; Loaded file /tmp/foo.fas
T
[2]> (time (foo))

Real time: 6.352963 sec.
Run time: 6.353035 sec.
Space: 23160048 Bytes
GC: 45, GC time: 0.046994 sec.
NIL ;
#<END-OF-FILE #x203FB2FE>
[3]> (quit)
Bye.
edi@miles:/tmp$ lw
LispWorks(R): The Common Lisp Programming Environment
Copyright (C) 1987-2005 LispWorks Ltd. All rights reserved.
Version 4.4.5
Saved by edi as lw-console, at 26 Apr 2005 11:21
User edi on miles
; Loading text file /usr/local/lib/LispWorks/lib/4-4-0-0/config/siteinit.lisp
; Loading text file /usr/local/lib/LispWorks/lib/4-4-0-0/private-patches/load.lisp
; Loading text file /home/edi/.lispworks

CL-USER 1 > (load (compile-file "foo.lisp"))
;;; Compiling file foo.lisp ...
;;; Safety = 3, Speed = 1, Space = 1, Float = 1, Interruptible = 0
;;; Compilation speed = 1, Debug = 2, Fixnum safety = 3
;;; Source level debugging is on
;;; Source file recording is on
;;; Cross referencing is on
; (TOP-LEVEL-FORM 1)
; (DEFCONSTANT +MAX-BUFLEN+)
; (DEFPARAMETER *BUFFER*)
; (DEFPARAMETER *POS*)
; (DEFPARAMETER *LINE*)
; (DEFCONSTANT +BASE-LINE-LENGTH+)
; (DEFPARAMETER *BASE-LINE*)
; LOAD-LINE-2
; FOO
; (TOP-LEVEL-FORM 2)
; Loading fasl file /tmp/foo.ufsl
#P"/tmp/foo.ufsl"

CL-USER 2 > (time (foo))
Timing the evaluation of (FOO)
; Loading fasl file /usr/local/lib/LispWorks/lib/4-4-0-0/load-on-demand/pcl/util/callcoun.ufsl

user time = 2.552
system time = 0.037
Elapsed time = 0:00:03
Allocation = 124448 bytes standard / 10413535 bytes conses
0 Page faults
NIL
#<END-OF-FILE 206EBF3C>

Pascal Bourguignon

unread,
Jul 18, 2005, 10:47:15 AM7/18/05
to
Edi Weitz <spam...@agharta.de> writes:

> On Mon, 18 Jul 2005 16:00:29 +0200, Pascal Bourguignon <p...@informatimago.com> wrote:
>
>> Perhaps you should choose a better implementation?
>
> Did it occur to you that these timings might be affected by other
> factors like hardware or the underlying OS?

We're not looking at the timings but at the garbage generated.

--
__Pascal Bourguignon__ http://www.informatimago.com/
Our enemies are innovative and resourceful, and so are we. They never
stop thinking about new ways to harm our country and our people, and
neither do we. -- Georges W. Bush

Edi Weitz

unread,
Jul 18, 2005, 11:01:43 AM7/18/05
to
On Mon, 18 Jul 2005 16:47:15 +0200, Pascal Bourguignon <p...@informatimago.com> wrote:

> Edi Weitz <spam...@agharta.de> writes:
>
>> On Mon, 18 Jul 2005 16:00:29 +0200, Pascal Bourguignon <p...@informatimago.com> wrote:
>>
>>> Perhaps you should choose a better implementation?
>>
>> Did it occur to you that these timings might be affected by other
>> factors like hardware or the underlying OS?
>
> We're not looking at the timings but at the garbage generated.

No. Alain said in his original message that "speed is of the
essence." My understanding is that the garbage that's generated isn't
interesting in itself but only as far as it affects the speed.

FWIW, in my tests CLISP on average spent 0.045 seconds doing GC while
LW used around 0.035 seconds. Go figure.

Cheers,
Edi.

PS: Just to make this clear: I don't care if CLISP needs a bit longer
for GC. I'm simply objecting against your "my implementation is
better than yours" games. It wasn't the first time you came up
with that and I think you're doing CLISP a disservice if you keep
running around telling people in c.l.l that CLISP is "better" than
other implementations because of this or that.

Kent M Pitman

unread,
Jul 18, 2005, 11:07:35 AM7/18/05
to
Edi Weitz <spam...@agharta.de> writes:

Well, CMUCL is open source, so I suppose you could come up with an alternate
way of doing it.

Did you contact the vendors to find out if the reason is "we want it that
way?" A lot of the time, it ends up being that vendors just do such tuning
as a need is discovered. Before you report to the world that your accidental
discovery is "cast in stone" you might want to talk to them. You'd expect
them to do the same before they reported something about the intent of your
code.

Personally, I've never seen this use of displaced arrays. That doesn't
make it a bad use, but points to the fact that people conceptualize
the use of operators in different ways. I've always seen them where
the displaced array was a "constant window" and pointing into
something someone else owns that is changing. Maybe they never
thought people would be dynamically adjusting the array in a tight
loop and so just never optimized it... You might think it was just a
matter of resetting some pointers, but maybe there's a bounds checker
or an extraneous arg checker or some other low-level thing that they
wanted to just hop through that was already present in the consing
entry point and they just grabbed that rather than custom-make an
alternative not knowing whether anyone cared... CL as a spec doesn't
say that you EVER have to avoid consing. That's assumed to get hammered
out by the free market, bug reports, etc.

Alain Picard

unread,
Jul 19, 2005, 3:34:58 AM7/19/05
to
Kent M Pitman <pit...@nhplace.com> writes:

> Personally, I've never seen this use of displaced arrays. That doesn't
> make it a bad use, but points to the fact that people conceptualize
> the use of operators in different ways.

But you don't see anything fundamentally flawed with the
usage I'm proposing? That's kind of what I'd like to know,
as I've never had to use ADJUST-ARRAY before, and I was just
looking for some sort of speed hack. So if an implementation
is free to make the particular usage I have in mind very cheap,
I could contact my vendor and ask them to extend the implementation
in this manner.

Alternately, I could really do with some sort of function
like READ-LINE-INTO-STRING, which would be the analogue of
READ-SEQUENCE, which would take a string with a fill-pointer
as argument, and read characters into the string, adjust the
fill pointer (and presumably signal some sort of error if
the newline, or whatever designated terminating character,
was not found soon enough). If this could result into
some speed improvement (because it could be made not to cons)
I'd be very happy.

Maybe this sort of problem is not as ubiquitous as I thought.

--
It would be difficult to construe Larry Wall, in article
this as a feature. <1995May29....@netlabs.com>

Alain Picard

unread,
Jul 19, 2005, 3:40:14 AM7/19/05
to
Pascal Bourguignon <p...@informatimago.com> writes:

> Perhaps you should choose a better implementation?

What a strange response! I ask a very specific question
about the allowed behaviour of a function in the standard,
and you suggest a "better" implementation?

In the real world, people have a lot of investment in an
implementation, both in code and business relationships.

The "better" implementation you seemed to suggest doesn't have support
for CORBA or multithreading. Neither does it, to my knowledge, come
with a support contract. Perhaps it's not "better" for my app after
all. Perhaps I wasn't clear, but my app _does_ do more than just read
lines and sum their length. :-)

--ap

Pascal Bourguignon

unread,
Jul 19, 2005, 5:46:18 AM7/19/05
to
Alain Picard <Alain....@memetrics.com> writes:

> Pascal Bourguignon <p...@informatimago.com> writes:
>
>> Perhaps you should choose a better implementation?
>
> What a strange response! I ask a very specific question
> about the allowed behaviour of a function in the standard,
> and you suggest a "better" implementation?

Perhaps we should call the Common Lisp Specification a Common Lisp
UNspecification. It takes great care not to specify a whole lot of
things. In particular, time and space complexities bounds are very
rarely specified. And specifically, NO bounds on the time and space
complexity of ADJUST-ARRAY are specified. So much for a very specific
answer about what the standard allows.


> In the real world, people have a lot of investment in an
> implementation, both in code and business relationships.
>
> The "better" implementation you seemed to suggest doesn't have support
> for CORBA or multithreading. Neither does it, to my knowledge, come
> with a support contract. Perhaps it's not "better" for my app after
> all. Perhaps I wasn't clear, but my app _does_ do more than just read
> lines and sum their length. :-)


If the clisp implementation doesn't please you, you can always write
your own implementation (of displaced vectors):


(shadow '(aref array-dimension #|etc|#))

(defclass displaced-vector ()
((size :reader displaced-vector-size
:initarg size
:type (integer 0))
(offset :reader displaced-vector-offset
:initarg offset
:type (integer 0))
(array :accessor displaced-vector-data
:initarg data
:type vector)))

(defmethod array-dimension ((self array) axis-number)
(cl:array-dimension self axis-number))

(defmethod array-dimension ((self displaced-vector) axis-number)
(unless (zerop axis-number)
(error "invalid axis-number"))
(displaced-vector-size self))


(defmethod aref ((self array) &rest indices)
(apply (function cl:aref) self indices))

(defmethod (setf aref) (value (self array) &rest indices)
(setf (apply (function cl:aref) self indices) value))


(defmethod aref ((self displaced-vector) &rest indices)
(when (cdr indices)
(error "Too many indices for a vector"))
(let ((index (car indices)))
(unless (< -1 index (displaced-vector-size self))
(error "Index out of range"))
(cl:aref (displaced-vector-data self)
(+ (displaced-vector-offset self) index))))

(defmethod (setf aref) (value (self displaced-vector) &rest indices)
(when (cdr indices)
(error "Too many indices for a vector"))
(let ((index (car indices)))
(unless (< -1 index (displaced-vector-size self))
(error "Index out of range"))
(setf (cl:aref (displaced-vector-data self)
(+ (displaced-vector-offset self) index)) value)))

(defmethod (setf displaced-vector-size) (new-size (self displaced-vector))
(unless (and (typep new-size (integer 0))
(< (+ new-size (displaced-index-offset self))
(array-dimension (dispalced-vector-data self) 0)))
(error "Invalid displaced-index-size"))
(setf (displaced-vector-offset self) new-size))


(defmethod (setf displaced-index-offset) (new-offset (self displaced-vector))
(unless (and (typep new-offset (integer 0))
(< (+ (displaced-vector-size self) new-offset)
(array-dimension (displaced-vector-data self) 0)))
(error "Invalid displaced-index-offset"))
(setf (displaced-vector-size self) new-offset))


(defun make-displaced-vector (size &key displaced-to displaced-index-offset)
(setf size (etypecase size
((integer 0) size)
(cons (if (and (null (cdr size))
(typep size '(integer 0)))
(car size)
(error "Invalid size for a vector")))))
(unless (<= size (array-dimension displaced-to 0))
(error "size for displaced vector too big"))
(unless (typep displaced-to 'vector)
(error "Invalid vector for displaced-to"))
(unless (and (typep displaced-index-offset (integer 0))
(< (+ size displaced-index-offset)
(array-dimension displaced-to 0)))
(error "Invalid displaced-index-offset"))
(make-instance 'displaced-vector
:size size :array displaced-to :offset displaced-index-offset))


;; That was an example for Joel Reymont but it'll do:

(defparameter data (make-array '(1000000)
:element-type 'float))
(dotimes (i 1000000) (setf (aref data i) (random 1.0)))

(defun backtest (window)
(assert (<= 0.0 (aref window 399) 1.0)))

(loop with window = (make-displaced-vector 400
:displaced-to data
:displaced-index-offset 0)
for n below (- (array-dimension data 0) (array-dimension window 0))
do (backtest window) (incf (displaced-index-offset window)))


--
__Pascal Bourguignon__ http://www.informatimago.com/

Small brave carnivores
Kill pine cones and mosquitoes
Fear vacuum cleaner

Alain Picard

unread,
Jul 19, 2005, 6:47:39 AM7/19/05
to
Pascal Bourguignon <p...@informatimago.com> writes:

> If the clisp implementation doesn't please you, you can always write
> your own implementation (of displaced vectors):

> [SNIP]

Thank you for your time and help. I loaded your code. After
correcting some typos (e.g.

(make-instance 'displaced-vector
:size size :array displaced-to :offset displaced-index-offset)

didn't work because there was no :array initarg on the DISPLACED-VECTOR
class, and (integer 0) did not appear to be a valid type
specification under LW), I finally succeeded in making a displaced-vector.

The first thing I tried was

(setq foo "this is a test string this is a test string")
"this is a test string this is a test string"

(setq dis (make-displaced-vector 10 :displaced-to foo :displaced-index-offset 5))
#<DISPLACED-VECTOR 22341874>

So far, so good. Now, use the darn thing:

(position #\s dis)
ERROR: #<DISPLACED-VECTOR 22341874> is not of type SEQUENCE.
[Condition of type TYPE-ERROR]

It looks like the cure is worse than the disease.
Can you see why this is not a useful solution? (Hint: I'd like
to use all of the existing Common Lisp sequence functions
on my vectors: position, find, remove, etc. etc.)

Please, before you respond, I think you need to take a step back and
think about the _real_ problem I'm trying to solve: manipulating input
ASCII lines as fast as possible.

cheers,
--ap


0 new messages