I need some help about how to best store data that resemble a database
table. I am interested in comments for both ACL and LWW.
I've got well-behaving records - say, each record is made up of 6
fixnums. On the surface, it would make sense to represent a record as a
class instance, a structure instance or a vector, and the table as a
vector or a hash-table.
Normally I would not care about the overhead, but there could
potentially be millions of such records, so it is important that the
size of a table does not exceed (* 6 4 rownum) bytes.
The problem I am having is that the overhead of instances or vectors is
higher than the number of bytes stored. This overhead serves no useful
purpose, because the table is homogeneous.
I could use a two-dimensional array of fixna or unsigned bytes, but then
I would loose the ability to use the built-in sort function - which is
not nearly as fast on unboxed values as boxed ones. I need to be able
to efficiently sort based on any of the 6 numbers.
I am interested in ways to minimize overhead when storing vectors or
instances in vectors or hash-tables. Alternatively, pointers to sorting
algorithms that are at least as efficient as the built-in sort function
(when it uses merge-sort), but work on arrays of numbers are
appreciated. In both cases, I am wondering how to measure space
consumption of objects, besides using time. (By the way, what's the
size of a cons cell in the above implementations?)
(One specialty here is that a possibly pre-sorted array is the object of
sorting. For example, the array is pre-sorted by columns A,B,C,D,E,F,
and the requirement is to sort by e.g., C,A,B,D,F,E or sort _and_ group
by C,B,F only. Again, I appreciate any pointers on relevant
algorithms.)
Thanks for your help,
Robert
I think that mergesort may be preferable because of the pre-ordering and
preferable worst-case scenario. I still wonder though how can I best
realize the benefits of pre-ordering -
for example, when there are two colums:
A 1
A 2
A 3
B 2
B 5
C 1
C 4
C 6
Here the target is to order by the 2nd column primarily, and the 1st
column secondarily. Merge-sort may give better performance if we use
the knowledge that we can 'cut' at the boundaries of A, B and C, and
merge from three sources.
Is there any published (or proprietary :-) algorithm to perform such a
specialised merge-sort? I can probably figure it out, but there may be
some worst-case scenarios - like merge-sorting from so many sources as
rows (if all letters are the same), and there may be multi-column
complications.
Thanks,
Robert
I was overly optimistic - I thought comparison operators would work on
fixnums or unsigned bytes without boxing (is there a reason why they
don't?). Because I cannot afford the 2x space and speed overhead of a
simple-array ("specialized" for t), it seems it leaves me no option.
(Also, boolean and bit functions either do or require consing.)
I saw a posting from Duane, in which he writes that unboxed comparison
may be difficult to do in the general case. I'd be happy with a
solution where I _need_ to declare that fixnums are used:
(defun sorti (array index column)
"Sorts index (fixnum vector) based on column number of fixnum array."
(declare (optimize (speed 3) (safety 0)))
(time (sort index #'(lambda (rownum1 rownum2)
(declare (type fixnum rownum1 rownum2))
(let ((foo (aref array rownum1 column))
(bar (aref array rownum2 column)))
(declare (type fixnum foo bar))
(< foo bar)))))
nil)
(sorti a i 1)
LWW says:
1.8 seconds used.
Standard Allocation 12324304 bytes.
Fixlen Allocation 0 bytes.
(Runtime goes up to 8(!) seconds if index is preordered, so it probably
does not even use a mergesort.)
Why it is difficult to compare two fixnums, if it's perfectly possible
to subtract them. I'd even settle with minusp - too bad that's boxing
too, and wants to use real instead of a fixnum!
Thanks for any suggestions in my challenge.
Robert
None of those box on fixnums. I tried your SORTI function in LWW 4.1,
and it didn't cons anything either. There must be something strange
about the way you initialize the arrays.
> I saw a posting from Duane, in which he writes that unboxed comparison
> may be difficult to do in the general case.
This is quite true. SORT takes a functional argument, after all. So
some advanced cross-function optimization would need to happen. But
fixnums are never boxed.
--
Pekka P. Pirinen
Adaptive Memory Management Group, Harlequin Limited
If it's spam, it's a scam. Don't do business with net abusers.