Good point. (Daniel's suggestion would delay the measurements
until after Java HotSpot(TM) VM's mixed mode has decided it is
worthwhile to use the JIT-compiler.)
Adding to my earlier explanation of why short lists are often
faster than short vectors, and illustrating some points made
by Jeffrey, here is the source code I benchmarked:
(define (search-list x y)
(cond ((null? y)
#f)
((eq? x (car y))
#t)
(else
(search-list x (cdr y)))))
(define (search-vector x y)
(let ((n (vector-length y)))
(let loop ((i 0))
(cond ((= i n)
#f)
((eq? x (vector-ref y i))
#t)
(else
(loop (+ i 1)))))))
In Larceny, the loop within search-list turns into the following
x86 machine code, to which I have added comments showing slightly
higher-level virtual machine code from which the x86 code was then
generated. (The check pseudo-instruction branches to an exception
sequence if the preceding safety check comes out false.)
00000018 80FA0A cmp dl,0xa ; reg 2 ; y
; op2 null?
0000001B 7506 jnz 0x23 ; branchf 0x23
0000001D BB02000000 mov ebx,0x2 ; const #f
00000022 C3 ret ; return
00000023 8D4207 lea eax,[edx+0x7] ; reg 2 ; y
00000026 A807 test al,0x7 ; op1 pair?
00000028 751C jnz 0x46 ; check 0x46,(2 0 0)
0000002A 8B7AFF mov edi,[edx-0x1] ; reg 2 ; y
; op1 car:pair
0000002D 39F9 cmp ecx,edi ; op2 eq?,1
0000002F 7506 jnz 0x37 ; branchf 0x37
00000031 BB06000000 mov ebx,0x6 ; const #t
00000036 C3 ret ; return
00000037 8B5203 mov edx,[edx+0x3] ; reg 2 ; y
; op2 cdr:pair
; setreg 2
0000003A FF4D08 dec dword [ebp+0x8] ; branch 0x18
0000003D 75D9 jnz 0x18
0000003F FF5518 call dword [ebp+0x18] ; (timer exception)
In the common case, that inner loop executes 11 machine instructions.
The inner loop for search-vector normally executes 20 instructions:
00000018 8B5D48 mov ebx,[ebp+0x48] ; reg 4 ; i
0000001B 39FB cmp ebx,edi ; op2 =:fix:fix,3
0000001D 7506 jnz 0x25 ; branchf 0x25
0000001F BB02000000 mov ebx,0x2 ; const #f
00000024 C3 ret ; return
00000025 8B5AFD mov ebx,[edx-0x3] ; reg 2 ; y
00000028 C1EB08 shr ebx,byte 0x8 ; op1 vector-length:vec
0000002B 895D7C mov [ebp+0x7c],ebx ; setreg 17
0000002E 8B5D48 mov ebx,[ebp+0x48] ; reg 4 ; i
00000031 3B5D7C cmp ebx,[ebp+0x7c] ; op2 <:fix:fix,17
00000034 7D5B jnl 0x91 ; check 0x91,(4 2 0)
00000036 8B4548 mov eax,[ebp+0x48] ; reg 4 ; i
; setreg eax ; TEMP
00000039 8B5C0201 mov ebx,[edx+eax+0x1] ; reg 2 ; y
; op2 vector-ref:trusted,eax
0000003D 895D7C mov [ebp+0x7c],ebx ; setreg 17
00000040 3B4D7C cmp ecx,[ebp+0x7c] ; reg 1 ; x
; op2 eq?,17
00000043 7506 jnz 0x4b ; branchf 0x4b
00000045 BB06000000 mov ebx,0x6 ; const #t
0000004A C3 ret ; return
0000004B 8B5D48 mov ebx,[ebp+0x48] ; reg 4 ; i
0000004E 8D5B04 lea ebx,[ebx+0x4] ; op2imm +:imm:idx,1
00000051 895D7C mov [ebp+0x7c],ebx ; setreg 17
00000054 895D48 mov [ebp+0x48],ebx ; setreg 4
00000057 FF4D08 dec dword [ebp+0x8] ; branch 0x18
0000005A 75BC jnz 0x18
0000005C FF5518 call dword [ebp+0x18] ; (timer exception)
Even though (vector-length y) is computed outside of the loop,
and remains available in register 3 (x86 register edi), Larceny
is recomputing that length every time through the loop, at the
cost of three machine instructions. Other inefficiencies can be
seen in both inner loops, including the cost of using only the
IA32 architecture's limited register set. If Larceny's compiler
were using the larger x64 register set, both loops should run
faster.
In my opinion, the main reason to use an optimizing compiler
is that it lets you focus on your programming problem instead
of obsessing about the micro-efficiency of your code. As the
machine code above reveals, there isn't much a programmer could
do to make these inner loops run faster. To make them faster,
you'd have to use a better compiler or sacrifice some safety.
Will