This may seem a petty concern, but I have recently worked on some algorithms
that very naturally fit in 32-bit integers (or 16 or 64) and they become so
much more /understandable/ in Common Lisp than in C, which I thought I could
still read and write fluently, but my C has evidently become rusty and I
find myself swearing at the computer when I try to write C again.
But I am no less frustrated by Common Lisp, whose performance is abysmal due
to bignum consomania when working with 32-bit numbers because the bignums
are immutable objects. I have in fact been sufficiently frustrated that I
have been thinking about what it would take to implement a Common Lisp that
used machine integers for integers and did not do any of those stupid
computer tricks to convert in and out of some other "integer" with spare
bits. I used to think that using spare bits was cool, that it was a stupid
waste to use a byte-adressable hardware when everything you could possibly
want to address was never smaller than a 4-byte unit, but had to concede
that I was only pining for the 36-bit processor days.
The usual way to implement fixnums is to use three "inband bits" to denote
integer (or two, using one bit for odd or even) and effectively reduce the
whole machine to one with 1/8th as many real objects as it had addressable
objects. That the hardware has 1/8th as much addressable capacity as it has
address lines is sheer insanity on the part of whoever decided that byte-
addressability was brilliant. It is particularly annoying since characters
should be 16 bits wide, too, and regardless of the character width, you
should use a byte pointer that would address bytes in a word. But there I
go pining for the PDP-10 again.
Given that we waste 3 bits on the address and they can be used for cool
things, the decision to reduce the integer value space similarly is a short
inference step away, but it is wrong. The right way is to use those three
bits for type tags (or you could go up to four), and to use an additional
"out-of-band bit" for the integer/other flag, carried elsewhere when a
register or other storage held "raw" 32-bit values. (Like the specialized
array type for 32-bit values.)
Naturally, one would have to have boxed integers, too, perhaps even
interning for integers. This has been done before, and I recall Kent Pitman
explaining something about a design for integer-passing failing miserably,
so I can understand that those who know the past both regard it as folly and
reject the request to try their ingenuity at this problem.
However. I find that my functions either pass integers around a lot or they
are almost entirely integer-free. The same goes for floating-point numbers.
If I want maximal floating-point performance, I can get that with the full
machine-supported floating-point sizes, but if I can get by without maximal
performance, I can live with boxed floating-point numbers. I think the same
should apply to integers.
At least one other language (I dare not name it :) uses both immediate and
boxed integers, but blew it as far as dynamic typing goes. Common Lisp does
not need to make the same design mistake. Boxing and unboxing of integers
can be made very efficient at the hardware level, and functions that work
mostly with integer arithmetic should be able to store some information
useful to the garbage collector about which registers are integers and which
are pointers without noticeable cost. I have obviously not worked this out,
fully, but I am sufficiently peeved by the abbreviated fixnum that I want to
find out if something like this is doable and what has already been tried
and considered to have been failures.
--
Erik Naggum, Oslo, Norway
Act from reason, and failure makes you rethink and study harder.
Act from faith, and failure makes you blame someone and push harder.
I think I understand your problem. Would you not consider (*-byte 32)
declarations along with block compilation as a possible solution?
This is a solutions to a slightly separate problem than losing bits in
'fast,small' integer types, but for the kinds of things that frustrate
you an explicit 32 bit declaration should work OK with a good
compiler. CMUCL comes close to being that compiler and I suspect it
could be taken all the way there with some (non-structural) tuning. I
don't know about other compilers. Am I missing the point?
cheers,
BM
> However. I find that my functions either pass integers around a
> lot or they are almost entirely integer-free. The same goes for
> floating-point numbers. If I want maximal floating-point
> performance, I can get that with the full machine-supported
> floating-point sizes, but if I can get by without maximal
> performance, I can live with boxed floating-point numbers. I
> think the same should apply to integers.
Do you consider CMUCL's support for unboxed values of type
(unsigned-byte 32) (or (signed-byte 32) for that matter) to be a step
in the right direction?
I.e. inside functions (and across block-compiled or inlined
functions), with suitable type declarations, operations on
(unsigned-byte 32) work directly on machine word sized (for 32 bit
machines, obviously) entities.
E.g.:
(defun foo (x y)
(declare (optimize (speed 3) (space 0) (safety 0) (debug 0))
(type (unsigned-byte 32) x y))
(logand (logior x y) (logxor x y)))
* (compile 'foo)
Compiling LAMBDA (X Y):
In: LAMBDA (X Y)
#'(LAMBDA (X Y)
(DECLARE (OPTIMIZE # # # #) (TYPE # X Y))
(BLOCK FOO (LOGAND # #)))
Note: Doing unsigned word to integer coercion (cost 20) to "<return value>".
Compiling Top-Level Form:
Compilation unit finished.
1 note
The x86 dissassembly of the function follows below. After the
prologue has checked the number and types of arguments, and unboxed
them (the SAR ops), one can see that the logical operations (starting
at L3) work directly on the 32 bit values. Since we must box/tag the
return value for functions that follow the normal calling convention,
we test whether any of the upper 3 bits of the result are set (the
TEXT ECX, 3758096384 operation, where the magic number is just
#xE0000000). If that's true, we create a boxed bignum representation
(L5 etc. [1]), otherwise we can just tag it as a fixnum with a
two-bit 0 tag (and a leading 0 sign bit, of course), using LEA EDX,
[ECX*4]. The code at L4 ist the normal function epilogue.
While this function doesn't show it, CMUCL can handle unboxed values
on the stack[2], so even when intermediate or argument values are
spilt to the stack, they can remain unboxed.
Regs, Pierre.
481E0770: .ENTRY FOO() ; FUNCTION
788: POP DWORD PTR [EBP-8]
78B: LEA ESP, [EBP-32]
78E: MOV EAX, EDX
790: TEST AL, 3
792: JEQ L0
794: MOV ECX, [EAX-3]
797: JMP L1
799: L0: SAR EAX, 2
79C: MOV ECX, EAX
79E: L1: MOV EAX, EDI
7A0: TEST AL, 3
7A2: JEQ L2
7A4: MOV EAX, [EAX-3]
7A7: JMP L3
7A9: L2: SAR EAX, 2
7AC: L3: MOV EDX, ECX ; No-arg-parsing entry point
7AE: OR EDX, EAX
7B0: MOV EBX, ECX
7B2: XOR EBX, EAX
7B4: MOV ECX, EDX
7B6: AND ECX, EBX
7B8: TEST ECX, 3758096384
7BE: JNE L5
7C0: LEA EDX, [ECX*4]
7C7: L4: MOV ECX, [EBP-8]
7CA: MOV EAX, [EBP-4]
7CD: ADD ECX, 2
7D0: MOV ESP, EBP
7D2: MOV EBP, EAX
7D4: JMP ECX
7D6: NOP
7D7: NOP
7D8: L5: JNS L6
7DA: MOV EDX, 522
7DF: JMP L7
7E1: L6: MOV EDX, 266
7E6: L7: MOV BYTE PTR [#x280001D4], 0 ; COMMON-LISP::*PSEUDO-ATOMIC-INTERRUPTED*
7ED: MOV BYTE PTR [#x280001BC], 4 ; COMMON-LISP::*PSEUDO-ATOMIC-ATOMIC*
7F4: MOV EAX, 16
7F9: ADD EAX, [#x8069724] ; current_region_free_pointer
7FF: CMP EAX, [#x80696F8] ; current_region_end_addr
805: JBE L8
807: CALL #x80536F8 ; alloc_overflow_eax
80C: L8: XCHG EAX, [#x8069724] ; current_region_free_pointer
812: MOV [EAX], EDX
814: LEA EDX, [EAX+7]
817: MOV [EDX-3], ECX
81A: MOV BYTE PTR [#x280001BC], 0 ; COMMON-LISP::*PSEUDO-ATOMIC-ATOMIC*
821: CMP BYTE PTR [#x280001D4], 0 ; COMMON-LISP::*PSEUDO-ATOMIC-INTERRUPTED*
828: JEQ L9
82A: BREAK 9 ; Pending interrupt trap
82C: L9: JMP L4
Footnotes for those interested in the nitty-gritty details:
[1] The code at L5 checks first whether the most-significant bit is
set. If it is, we have to alloc two words for the bignum, since a full
32 bits, plust the now required sign bit will not fit inside a single
machine word. If we only have 31bits worth of integer, we can fit it
and the leading sign bit inside a single machine word. Since we also
need a word for the header, we thus allocate either 2 or 4 words of
storage, (the heap is 8 byte aligned).
[2] Either using separate unboxed stack(s), on platforms that
segregate unboxed and boxed values in stacks and the register file
(i.e. currently non-x86 platforms), or using a conservative GC on the
other platforms (i.e. x86 currently).
--
Pierre R. Mai <pm...@acm.org> http://www.pmsf.de/pmai/
The most likely way for the world to be destroyed, most experts agree,
is by accident. That's where we come in; we're computer professionals.
We cause accidents. -- Nathaniel Borenstein
For the particular case where I work specifically with modulo-32-bit
integers, that would suffice, but where I need real integers, I do not want
C semantics, but I would really appreciate being able to use the full
machine width without going to bignums with the last 2 bits.
| Am I missing the point?
What I really want is better integration with the hardware and other tools
on the same hardware.
On the other hand, if we had boxed integers in the normal case, I could
imagine using interned integers and single-word boxes that could be upgraded
very efficiently to a bignum when whole machine words were used in the
computations. The assembler programmer in me still protests against the
shifting and reduced value space. So I also wonder what people have done,
not whether they are satisfied with the status quo, which I guess most
people are, at least as far as doing anything about it.
> I really wish fixnums were the natural machine word.
I agree with you.
> Naturally, one would have to have boxed integers, too, perhaps even
> interning for integers. This has been done before, and I recall Kent Pitman
> explaining something about a design for integer-passing failing miserably,
> so I can understand that those who know the past both regard it as folly and
> reject the request to try their ingenuity at this problem.
You might want to look at what Gwydion Dylan does. An object whose type
is known is stored (on a 32-bit machine) as the following C structure:
typedef struct descriptor {
heapptr_t heapptr;
union {
long l;
float f;
void *ptr;
} dataword;
} descriptor_t;
If the value stored is a fixnum ("<integer>") then the value is stored
in the .l field and a pointer to a fixnum proxy object is stored in the
.heapptr field. Characters are also stored in .l, but with .heapptr
pointing to a different proxy object. Objects stored on the heap simply
use the .heapptr field, ignoring the other half of the struct.
These descriptor structures are passed around by copying e.g. as
function arguments and result values.
Clearly this is wastefull of space, and to a lesser degree time, but:
> I find that my functions either pass integers around a lot or they
> are almost entirely integer-free. The same goes for floating-point numbers.
Exactly, and Gwydion takes advantage of this to reduce the storage usage
whenever possible. If the compiler knows that a value is definitely a
fixnum, or definitely a character, or definitely a single precision
floating point number, then it uses a simple machine word to store it.
Conversely, if the compiler knows that a given value is none of those
immediate types then it uses a simple 32 bit pointer to the object on
the heap.
> At least one other language (I dare not name it :) uses both immediate and
> boxed integers, but blew it as far as dynamic typing goes.
I need a coffee, just thinking about it... :-(
> Common Lisp does not need to make the same design mistake. Boxing
> and unboxing of integers can be made very efficient at the hardware
> level,
True. In GD there is no consing involved with boxing and unboxing of
fixnums.
> and functions that work mostly with integer arithmetic should be
> able to store some information useful to the garbage collector
> about which registers are integers and which are pointers without
> noticeable cost.
GD doesn't do this, but instead uses a conservative, non-copying GC
(Boehm). This decision might change if and when C-- becomes viable as a
compilation target.
> I have obviously not worked this out, fully, but I am sufficiently
> peeved by the abbreviated fixnum that I want to find out if
> something like this is doable and what has already been tried and
> considered to have been failures.
It seems to be reasonably successful in Gwydion Dylan. Unoptimized code
wastes a lot of space and a bit of time with these 8 byte descriptors,
but it's usually pretty easy to get rid of most of them in those parts
of the program where you actually care - in fact most of them disappear
without any real effort once you start declaring the types of things.
Types such as false-or(<integer>) are the main time that things stay as
descriptors. Type-unions of any combination of #t, #f, symbols, and
general heap structures are represented as a simple 32 bit pointer --
and that remains true just as long as no fixnum or character finds its
way into the type-union.
Performance overall certainly doesn't seem to suffer noticably compared
to, say, CMUCL or O'Caml.
-- Bruce
> * Bulent Murtezaoglu <b...@acm.org>
> | Would you not consider (*-byte 32) declarations along with block compilation
> | as a possible solution?
>
> For the particular case where I work specifically with modulo-32-bit
> integers, that would suffice, but where I need real integers, I do not want
> C semantics, but I would really appreciate being able to use the full
> machine width without going to bignums with the last 2 bits.
>
> | Am I missing the point?
>
> What I really want is better integration with the hardware and other tools
> on the same hardware.
>
> On the other hand, if we had boxed integers in the normal case, I could
> imagine using interned integers and single-word boxes that could be upgraded
> very efficiently to a bignum when whole machine words were used in the
> computations. The assembler programmer in me still protests against the
> shifting and reduced value space. So I also wonder what people have done,
> not whether they are satisfied with the status quo, which I guess most
> people are, at least as far as doing anything about it.
Several years ago, I went through a similar thought process, and it led me
to experiment on what I currently consider to be a dead-end. It is
undocumented, and looking at it again I can see a bug, but you might want
to experiment with it in Allegro CL, where it still exists.
Here is the thought process:
- Allegro CL already compiles "unboxed-integers" for any declared range
of signed integer that is larger than the fixnum range but within the
machine-integer range. Such calculations are done by first "unboxing"
the integers (either by shifting if they are integers or by extracting
bigit bits if they are bignums), and then by performing calculations
until the machine-integer range is exceeded or until a tagged lispval
is required. There are two major problems with this:
1. Boxing is required often, including when function calls are made.
We mitigate this problem with an undocumented but relatively well-known
technique called "immediate-args" (I described this at a LUGM tutorial
in 1998). This allows lisp functions and calls to be configured to
allow non-lisp arguments to be passed between them without requiring
boxing. Of course, this is not what you want, since these values are
unboxed and thus untyped (i.e. unsafe, non-dynamic, etc.)
2. The result of an expression compiled in unboxed manner must always
be boxed.
The traditional method of mitigating this second problem is by allocating
a specialized array of machine-integer (either (signed-byte 32) or
(signed-byte 64), depending on lisp width). So if you did something like:
CL-USER(9): (setq *x* (make-array 1 :element-type '(signed-byte 32)))
#(0)
CL-USER(10): (setf (aref *x* 0) (1+ most-positive-fixnum))
536870912
CL-USER(11): *x*
#(536870912)
CL-USER(12):
Note first that if the (setf aref) had been compiled within a function,
and no result were returned, and optimization settings were conducive,
then the setf would not cons, because the "box" has already been created.
_However_, it is highly inconvenient that that "box" cannot be treated as
if it were a number, because in order to access the value, you must use
aref (as indeed you must use (setf aref) in order to store into the "box").
So I invented three "box" types, for integer, single-float, and double-float,
using a creating function and a setfable accessor. Create-box takes a type
and creates what looks like a number of that type, and (setf box-value)
stores a value into the box. The box-value accessor seems as if it does
nothing, but in fact it takes the value out of the box and puts it where
you ask it to (either using the value in the next unboxed-calculation, or
boxing it up into a new immutable number with that value).
Here is an example usage:
CL-USER(1): (defvar *x* (excl::create-box 'integer))
*X*
CL-USER(2): (defvar *y* (1+ most-positive-fixnum))
*Y*
CL-USER(3): (describe *x*)
536870912 is a BIGNUM.
It is a writable box.
It is 30 bits wide
CL-USER(4): (describe *y*)
536870912 is a BIGNUM.
It is 30 bits wide
CL-USER(5): (setf (excl::box-value *x*) #x7fffffff)
2147483647
CL-USER(6): (setf (excl::box-value *y*) #x7fffffff)
Error: 536870912 is not a machine-integer box
[condition type: SIMPLE-ERROR]
Restart actions (select using :continue):
0: Return to Top Level (an "abort" restart).
1: Abort entirely from this process.
[1] CL-USER(7): :res
CL-USER(8): *x*
2147483647
CL-USER(9):
Note that a "box" is really a mutable number (an integer in this case).
You can create single-float and double-float boxes as well. These
accessors compile in unboxed fashion, with
(setf (box-value x) <expr-that-can-be-compiled-unboxed>)
not requiring any consing, and the result of the setf being a
fully boxed (but not consed, because it had already been preallocated)
value.
And now, for the caveats, and why I highly recommend against using this
feature for any real calculations:
1. The integer box is buggy - I set the box-value to a negative number,
and it destroyed the writability of the box (box-ness and the sign are
in the same byte and apparently interact).
2. If you make a lot of these boxes for the purposes of reducing consing,
then what you will have in your lisp is a bunch of what looks like
numbers, but whose values are not inviolate. Thus, unless you strictly
manage the boxes you create, multiple calculations could have surprising
results (talk about your obfuscated lisp-code! :-)
3. Single-float and double-float boxes do not have this problem:
Since Allegro CL assumes that all integers have been normalized (i.e.
there are never any bignums that are of less than fixnum magnitude)
and since calculations might break if these assumptions are violated,
a (setf box-value) functionality will never actually store a value
whose magnitude is a fixnum into a box. Instead, it simply returns
the shifted fixnum value. Note that normally (setf box-value) returns
not the value stored, but the box that was stored into (or the shifted
fixnum, in this special case). So if you are expecting the value of
*x* to change, it won't do so - it is sort of like a destructive
operation: I would recommend the following paradigm:
(setq *x* (setf (excl::box-value *x*) <result-that-might-be-fixnum-size>))
but of course if the setf resulted in a fixnum, then *x* would no
longer be a box...
It's been years since I looked at this. It was interesting reminiscing
about an old experiment.
--
Duane Rettig du...@franz.com Franz Inc. http://www.franz.com/
555 12th St., Suite 1450 http://www.555citycenter.com/
Oakland, Ca. 94607 Phone: (510) 452-2000; Fax: (510) 452-0182
> I really wish fixnums were the natural machine word. [..]
Whenever I think about this problem, I tend to conclude that the best
solution would be to have a sub-language for integer calculations that
is a better fit to typical CPUs than native CL, making it easier for
the compiler to do the right thing, and maybe even yield code that is
easier to read and write for programmers thinking at the assembly
level. In combination with some means to pass unboxed integers between
functions, I think this is a solution that could work, albeit not as
elegant as having pervasive word-wide fixnums.
This integer sub-language would perhaps have a default
(i.e. CL-implementation oblivious) compiler/macro-expander that
represents each integer with two lisp values, or use some other such
trick to avoid consing.
--
Frode Vatvedt Fjeld
I want my Lisp Machine.
> But there I
> go pining for the PDP-10 again.
That'd be great, too.
PDP-10 Maclisp and Franz Lisp used bibop rather than tag bits to type
pointers. They shifted down the address to get a page number which was
used as an index into an array of type values.
This meant that pointers as well as fixnums could be passed straight
to foreign code or the operating system.
On a modern CPU with huge penalties for cache misses, tags make more
sense as they will always be in the cache along with the pointer that
they type. Needing to load a cacheline for the pointer as well as one
for part of the type table would really reduce cache efficiency.
I have always wished that more effort had been made to try adding
hardware features to speed up bibop. All of the lispms and the RISC
chips that I know of (SOAR, SPUR and SPARC) used tags.
I started building an add-on board to the AtariST to do hardware bibop
in the 80's, but I got a faster 386 with paged VM soon after and never
finished it. The 68000 was an ideal CPU for this as it had three
"function code" pins that would indicate whether a bus cycle was for
instruction fetch or data read.
In a modern CPU there are usually several spare bits in a page table
entry. I would like to see some way of using them to store the type
of objects in that page.
It would be nice to have a user mode instruction that would treat the
contents of a source register as a pointer and load the type bits for
it's page table entry into a destination register.
Robert Swindells
Me too.
Although it's not the same thing, I presume that 64 bit lisps do (or
could) have fixnums which are significantly bigger than 32 bits, so
the various `naturally 32 bit' algorithms like lots of crypto stuff
should be quite easy to implement without horror consing. Decent
floats will still be a pain though.
Of course that assumes pervasive 64 bit HW, which seems to be very
slow arriving in the PC world.
--tim
When I implemented these things myself, I used fairly large pages (64K) and
held the type information in the first few words of the page. To access the
type, just zero the lowest 16 bits and read the type information. The cache
line would be fairly frequently accessed and would probably remain in L1.
When allocating values from such a type-homogenous page, the first few words
of the page would also contain the next unused unit. Since this word is in
the same cache line as the type information, it would also ensure that that
cache line would remain in L1. Which page to allocate from would be found
from a table of "open pages" for each type. Garbage collection knew the
layout of the types on a page from other administrative information on the
page and copying out those values that were still referenced was fairly
simple. Mark bits was implemented efficiently with one bit per allocated
type and stored in dedicated words. I used this scheme mainly to take care
of C programs that allocated a huge number of small pieces of memory of a
small set of sizes and noticed that the standard allocator used up a lot of
memory for administrative purposes and got really wasteful. Since it was
both expensive to allocate 64K-pages and wasteful to move things around too
much, I could exclude individual pages from the garbage collection with a
flag in the page that would cause it not to be copied. I implemented this
on the first Unix machine I had access to, an Altos running on M68K that had
4M or memory. Because of this new allocation scheme, the program was able
to solve problems three times as large as it could do with normal malloc and
free. We actually found that some types were used only for slowly acreting
long-term storage that had previously caused much fragmentation of memory
and that there was more harm than good in freeing these objects and this
caused the whole new allocator design.
| On a modern CPU with huge penalties for cache misses, tags make more sense
| as they will always be in the cache along with the pointer that they type.
| Needing to load a cacheline for the pointer as well as one for part of the
| type table would really reduce cache efficiency.
Not necessarily. The above scheme, now at least 20 years old, would scale
well in modern processors. I have not tried to reimplement it. Perhaps I
should... I "invented" it on the spot, but I have no idea whether this
scheme has been "independently invented" in the interim. I have seen
various scheme to allocate memory in particular sizes, but a lot of them
have just used the (IMHO stupid) word before the object to make a linked
list of freed objects, wasting much memory with alignment requirements.
| The 68000 was an ideal CPU for this as it had three "function code" pins
| that would indicate whether a bus cycle was for instruction fetch or data
| read.
Never did hardware stuff on the M68K, but I really liked its instruction
set. Such a pity that Intel "won" with its braindamaged design.
| In a modern CPU there are usually several spare bits in a page table entry.
| I would like to see some way of using them to store the type of objects in
| that page.
In my experience, you may need quite a bit of administrative overhead at the
beginning of the page, and the usual page size of 4K is way small to pay for
the overhead. Or so I thought when I used 64K pages. It was also sexy in
that the &~0xffff operation turned into smarter instructions in the compiler
back then. Ah, the calm breeze on these walks down memory lane (several
puns intended).
I hear you; I'm not sure I agree.
In order for fixnums to be the natural machine word, the values either
"bloat" the size of "fixnum objects," or require some other form of
boxing.
The situation seems reminiscent of what's in Java; Java _isn't_ an
"objects-all-the-way-down" language in that you often have to choose
between the (not-objects) "int" type and the (yes, it's an object)
Integer type. 32 bit ints on 32 bit architectures are liable, once
aligned, to wind up consuming 64 bits, thus wasting lots of room.
What I would prefer to see is for 64 bit machines to get more popular,
so that the basic "fixed integer" type could have quite a few tag bits
stolen off the top end without injuring the dynamic range in any way
that we'd care about.
Of course, the 36 bit fans doubtless have some useful commentary
too... I'm not sure that they normally had 36 bit long integers; I
think they had the _expectation_ of some bits getting eaten by tags of
one sort or another...
--
(concatenate 'string "aa454" "@freenet.carleton.ca")
http://www3.sympatico.ca/cbbrowne/spiritual.html
Rules of the Evil Overlord #18. "I will not have a son. Although his
laughably under-planned attempt to usurp power would easily fail, it
would provide a fatal distraction at a crucial point in time."
<http://www.eviloverlord.com/>
That is indeed the problem that I am trying to "even out". The cost of an
integer in the range between fixnum max and machine integer max today is
prohibitive and we end up working with integers that have unpredictable
costs. I would like a little higher memory cost medium-sized integers in
order to reduce the overall cost for 32-bit integers. (It would of course be
economical to intern the smallest integer range.)
| The situation seems reminiscent of what's in Java [...]
I already covered this. It is not unless you miss the point.
> * Robert Swindells
> | PDP-10 Maclisp and Franz Lisp used bibop rather than tag bits to type
> | pointers. They shifted down the address to get a page number which was used
> | as an index into an array of type values.
[... description of non-immediate integer allocation technique]
This is probably the best way to do this, if one is willing to deal
with the additional overhead of memory accesses for non-immediates
vs immediates. I think I saw you state elsewhere that you are willing
to accept such overhead.
> | On a modern CPU with huge penalties for cache misses, tags make more sense
> | as they will always be in the cache along with the pointer that they type.
> | Needing to load a cacheline for the pointer as well as one for part of the
> | type table would really reduce cache efficiency.
>
> Not necessarily. The above scheme, now at least 20 years old, would scale
> well in modern processors. I have not tried to reimplement it. Perhaps I
> should... I "invented" it on the spot, but I have no idea whether this
> scheme has been "independently invented" in the interim. I have seen
> various scheme to allocate memory in particular sizes, but a lot of them
> have just used the (IMHO stupid) word before the object to make a linked
> list of freed objects, wasting much memory with alignment requirements.
Correct. However, the overhead of not treating integers as immediate must
be a design consideration (i.e. in both Franz Lisp and most CLs, an addition
of two fixnums simply involves adding the bits and optionally testing for
overflow, whereas if I understand your scheme correctly, it involves actually
accessing the memory for the integer values). These memory accesses may be
entirely acceptable.
> | The 68000 was an ideal CPU for this as it had three "function code" pins
> | that would indicate whether a bus cycle was for instruction fetch or data
> | read.
>
> Never did hardware stuff on the M68K, but I really liked its instruction
> set. Such a pity that Intel "won" with its braindamaged design.
Hear, hear.
> | In a modern CPU there are usually several spare bits in a page table entry.
> | I would like to see some way of using them to store the type of objects in
> | that page.
>
> In my experience, you may need quite a bit of administrative overhead at the
> beginning of the page, and the usual page size of 4K is way small to pay for
> the overhead. Or so I thought when I used 64K pages. It was also sexy in
> that the &~0xffff operation turned into smarter instructions in the compiler
> back then. Ah, the calm breeze on these walks down memory lane (several
> puns intended).
It should still be possible, given extra bits (where "extra" means any number of
bits more than the natural (32 or 64) word size per word), if these extra
bits are user-accessible in some way, to come up with a very low overhead
mechanis for dealing with this problematic integer range.
> * Erik Naggum wrote:
> > I really wish fixnums were the natural machine word.
>
> Me too.
>
> Although it's not the same thing, I presume that 64 bit lisps do (or
> could) have fixnums which are significantly bigger than 32 bits, so
> the various `naturally 32 bit' algorithms like lots of crypto stuff
> should be quite easy to implement without horror consing. Decent
> floats will still be a pain though.
But what happens then when the "naturally 32 bit" algorithms start
becoming "naturally 64 bit"?
> Erik Naggum <er...@naggum.no> wrote in message news:<32387790...@naggum.no>...
> > I really wish fixnums were the natural machine word.
>
> PDP-10 Maclisp and Franz Lisp used bibop rather than tag bits to type
> pointers. They shifted down the address to get a page number which was
> used as an index into an array of type values.
>
> This meant that pointers as well as fixnums could be passed straight
> to foreign code or the operating system.
Well, yes, but fixnums in FranzLisp contained even fewer bits than on
more modern CL implementations (I think that the range was -1023 to
1022, or something like that). In fact, a section of memory was
explicitly allocated for fixnums, so that no other lisp objects
could be allocated at that location (so that the gc and type-of could
not become confused as to whether or not a bit pattern was a fixnum
or a lisp object of some other type. And since fixnums have magnitude
and order, it was not just a case of allocating new pages of fixnums
when needed.
The implication to this is that if you cover all of the 32-bit space
with fixnums, there is no longer any room for anything else (neither
other lisp objects nor even program space).
> On a modern CPU with huge penalties for cache misses, tags make more
> sense as they will always be in the cache along with the pointer that
> they type. Needing to load a cacheline for the pointer as well as one
> for part of the type table would really reduce cache efficiency.
The modern CLs do not use memory accesses for fixnums; they are
immediates. Therefore going back to bibop (in hashed-address form)
for fixnums carries with it an automatic penalty with it, no matter
how fast your cache.
> I have always wished that more effort had been made to try adding
> hardware features to speed up bibop. All of the lispms and the RISC
> chips that I know of (SOAR, SPUR and SPARC) used tags.
>
> I started building an add-on board to the AtariST to do hardware bibop
> in the 80's, but I got a faster 386 with paged VM soon after and never
> finished it. The 68000 was an ideal CPU for this as it had three
> "function code" pins that would indicate whether a bus cycle was for
> instruction fetch or data read.
The entire reason for modern lisps moving away from special-purpose
hardware is because such hardware cannot compete in the long term.
There is no way for a LispM company to devote close to the same
order of magnitude of manpower to keep the performance curve up with
GP hardware. However... if you can use existing GP hardware to
advantage:
> In a modern CPU there are usually several spare bits in a page table
> entry. I would like to see some way of using them to store the type
> of objects in that page.
This is a good idea, as long as
1. The pte bits truly are spare (and are likely to stay that way)
2a. Your Operating system allows you to use these bits, or
2b. You are willing to write your own LispOS.
> It would be nice to have a user mode instruction that would treat the
> contents of a source register as a pointer and load the type bits for
> it's page table entry into a destination register.
Most architectures do indeed have such instructions, but some of them
regard such instructions as supervisor-only, and thus only accessible
by a LispOS kernel, or by explicit kernel support in whatever Unix
kernel you are operating under.
Duane Rettig <du...@franz.com> writes:
> But what happens then when the "naturally 32 bit" algorithms start
> becoming "naturally 64 bit"?
Which they will --- Bruce Schneier states that making blowfish run
fast on "typical 32 bit hardware"(*) was an _explicit_ design goal.
So we'll be back to square one.
For the record, the CMUCL hacks (to unbox 32bit modulo adds and xors)
make blowfish run 120 times faster than the bignum calls in Lispworks 4.2.
Even then, I'm surely missing something basic, because Schneier says
blowfish should run at 8 Mb/s on a 150Mhz pentium (18 clock
cycles/byte), and mine runs at 3 Mb/s on a 433Mhz Celeron.
I've read that in Lisp you can write slow code fast, and then
have to work for the speed. I didn't realize just _how_ _hard_,
however. :-)
(*) Interestingly, Schneier conjectures that blowfish is _probably_ strong
using 16 bit blocks (which _would_ fit into fixnums of essentially
all current lisps). I ran out of enthusiasm and didn't write that
code to benchmark it.
> But what happens then when the "naturally 32 bit" algorithms start
> becoming "naturally 64 bit"?
Well, yes, and floats are already hard. I suppose my only hope would
be that a lot of stuff (like crypto) really *doesn't* need 64 bit
machines, and so 32 bit algorithms will remain common. But I can
imagine someone saying exactly the same thing about 16 bit algorithms.
--t
> Well, yes, and floats are already hard. I suppose my only hope would
> be that a lot of stuff (like crypto) really *doesn't* need 64 bit
> machines, and so 32 bit algorithms will remain common.
They don't _need_ 64 bit machines or 32 bit machines, they're just
designed for this, because it is common. That is, this goes for stuff
like Blowfish, which (IIRC) is mainly meant to be implemented in
software, whereas algorithms like Rijndael should also work well even
on 8 bit machines. (I think the idea is to have an algorithm that
they can use on smartcards.)
--
johs
> [..] a technique called "immediate-args" (I described this at a LUGM
> tutorial in 1998). This allows lisp functions and calls to be
> configured to allow non-lisp arguments to be passed between them
> without requiring boxing. Of course, this is not what you want,
> since these values are unboxed and thus untyped (i.e. unsafe,
> non-dynamic, etc.) [...]
Is this tutorial or some documentation of this available somewhere?
--
Frode Vatvedt Fjeld
Well, yes, but it's just a powerpoint slide presentation. There are
some comments, but I had gotten frustrated with MS software not
allowing me to add as many comments as I wanted, without trashing
pointers somewhere :-( I salvaged as many comments as possible
without trashing any of the slides, but there are many holes in
the comments.
ftp://ftp.franz.com/pri/duane/LUGM-tut.ppt
It's about 850Kb.
> I recently had the fun (!) of discovering how hard it can
> be to really optimize CL code while implementing BLOWFISH.
> What an eye opener!
>
> Duane Rettig <du...@franz.com> writes:
>
> > But what happens then when the "naturally 32 bit" algorithms start
> > becoming "naturally 64 bit"?
I should have placed a smiley here. Encryption algorithms
are _already_ greater than 32 bits naturally. Some are 128 bits.
> Which they will --- Bruce Schneier states that making blowfish run
> fast on "typical 32 bit hardware"(*) was an _explicit_ design goal.
> So we'll be back to square one.
Blowfish may have been designed to fit into 32-bit hardware (C mentality)
but it is a 64-bit algorithm. The 32-bit-ness comes from breaking up
each block into 32-bits at a time.
> For the record, the CMUCL hacks (to unbox 32bit modulo adds and xors)
> make blowfish run 120 times faster than the bignum calls in Lispworks 4.2.
There's the carrot ...
> Even then, I'm surely missing something basic, because Schneier says
> blowfish should run at 8 Mb/s on a 150Mhz pentium (18 clock
> cycles/byte), and mine runs at 3 Mb/s on a 433Mhz Celeron.
>
> I've read that in Lisp you can write slow code fast, and then
> have to work for the speed. I didn't realize just _how_ _hard_,
> however. :-)
And there's the work to get the carrot. If the carrot is very tasty,
then one works for it.
However, I wonder just how muuch work it really is, compared to the
same algorithm in C. Bear with me here, because the answer might not
be as intuitive as one might think:
I've seen Lisp/C coders (people who know both languages) bend over
backwards to make C code work (and of course, work fairly fast. When
asked how hard it was, they'll say "it was no big deal". But then those
same people will complain vociferously at all of the terrible extra work
requuuired to optimize som already-working lisp code (it worked right
away, of course) even if that extra effort consisted of placing a single
declaration in the code.
So it would be interesting to figure out (probably necessarily by
measurement of actual time spent rather than by estimation based on
intuition) how much time it took you to do this optimization _as_
_oppoesed_ to implementing the same algorithm in C. Bear in mind
also that Lispers aren't used to having to make such optimizations
(they usually simply aren't necessary), and so one would have to
mitigate the times based on being an expert C coder vs a novice
fast-lisp coder. All things considered and factored in, I'd suspect
that the times are pretty similar, if not _still_ tipping the scales
in Lisp's favor, even with all of the warts of trying to shoehorn a
64-bit (or 128-bit) algorithm into 32 bits using the same shoehorn
as C is using...
> (*) Interestingly, Schneier conjectures that blowfish is _probably_ strong
> using 16 bit blocks (which _would_ fit into fixnums of essentially
> all current lisps). I ran out of enthusiasm and didn't write that
> code to benchmark it.
But if the 32-bit technique can be used, why dumb down to 16 bits?
Or, why not use 29 bits instead? Three fixnums per block might work...
Yes, however it is important to be able to implement commercially
important algorithms efficiently, and saying that it's hard because
types that the implementation efficiently supports have a `funny' (to
the designers of these algorithms) size is not a good answer.
--tim
> Alain Picard <apicard+die...@optushome.com.au> writes:
>
> Blowfish may have been designed to fit into 32-bit hardware (C mentality)
> but it is a 64-bit algorithm. The 32-bit-ness comes from breaking up
> each block into 32-bits at a time.
I guess that's a matter of definition; yes, it encrypts 64 bit blocks,
but was explicitly designed to do so while holding two 32-bit machine words,
with the knowledge that addition and xoring those two numbers would be
native processor instructions. In my mind this makes it a "32-bit"
algorithm; I hope you can see my POV.
> However, I wonder just how muuch work it really is, compared to the
> same algorithm in C. Bear with me here, because the answer might not
> be as intuitive as one might think:
>
> I've seen Lisp/C coders (people who know both languages) bend over
> backwards to make C code work (and of course, work fairly fast. When
> asked how hard it was, they'll say "it was no big deal". But then those
> same people will complain vociferously at all of the terrible extra work
> requuuired to optimize som already-working lisp code (it worked right
> away, of course) even if that extra effort consisted of placing a single
> declaration in the code.
Well, I'm still in the minor leagues where Lisp is concerned; this was
the first time I really tried to strangle performance out of it. And
my experience should not be viewed as negative or as a complaint; as you
rightly put it, I _did_ get to a _working_ algorithm extremely fast.
> So it would be interesting to figure out (probably necessarily by
> measurement of actual time spent rather than by estimation based on
> intuition) how much time it took you to do this optimization _as_
> _oppoesed_ to implementing the same algorithm in C. Bear in mind
> also that Lispers aren't used to having to make such optimizations
> (they usually simply aren't necessary), and so one would have to
> mitigate the times based on being an expert C coder vs a novice
> fast-lisp coder. All things considered and factored in, I'd suspect
> that the times are pretty similar, if not _still_ tipping the scales
> in Lisp's favor, even with all of the warts of trying to shoehorn a
> 64-bit (or 128-bit) algorithm into 32 bits using the same shoehorn
> as C is using...
I hear you, but I think this is a pathological example; heck the
algorithm design all but _tells_ you just how to smash the bits
together with the C operators to perform the blowfish rounds. I don't
a (non-naïve) C coder _can_ really implement this so it will run
slowly (say, as slowly as Lisp which is consing bignums).
On so many other problems, however, I feel that Lisp allows me to
program the solution so much more _cleanly_ and _obviously_ that
it's bound to be faster; and I'm getting much better at getting a
feel for what operations in Lisp are expensive.
> But if the 32-bit technique can be used, why dumb down to 16 bits?
> Or, why not use 29 bits instead? Three fixnums per block might work...
Actually, going down to 16 bits isn't "dumbing" down the algorithm
(as I said, Schneier conjectures the algorithm is still "strong"),
and as for using 16 instead of 29 bits, the algorithm is just a lot
simpler for block sizes which are powers of two.
Don't get me wrong; I _love_ lisp. It's just that I can see those cases
where FFI looks very inviting. :-)
> I really wish fixnums were the natural machine word.
I agree with you, and I really think this is a problem the compiler
can solve. My current attempt at a solution[*] is basically being
able to pass around unboxed numbers. Take code like the following:
(defun foo (x y)
(declare (optimize (speed 3)
(size 0)
(ext:redefinition 0)))
(+ (expt x 2) y))
The compiler will create two entry points: one takes normal Lisp
values, the other takes (signed-byte 64)'s [on SPARC]. Code that is
compiled knowing this definition will pass in unboxed machine words
when the arguments fit, and computation will proceed essentially as if
it were using fixnums. If they overflow, they'll get promoted to
bignums (and switch code paths). The two disadvantages are code bloat
and making redefinition (and later use of the redefined function)
potentially more expensive.
[*] It's not currently working, but I think it's perfectly workable,
it just needs more work.
--
/|_ .-----------------------.
,' .\ / | No to Imperialist war |
,--' _,' | Wage class war! |
/ / `-----------------------'
( -. |
| ) |
(`-. '--.)
`. )----'