-- Aravindh
Although you could look the instruction timings in the CPU
manufacturers documentation, stuff like cache access is going to be
very variable. It's going to depend on a bunch of factors that are
different on every PC.
If you want to do it yourself, the Pentium has a timer which you can
use getting the current cycle count, which can be used to time sections
of code. A code sample below shows a C/C++ function that would return
that value.
You then need to set up your code so that you can do timings you want
by forcing cache misses, which will be fairly involved. In addition the
operating system will be busy doing things in the background and throw
your timings off.
// Warning :: Assumes you have one!
__int64 GetPentiumTimeStampCounter()
{
__int64 RDTSC;
// read the Pentium Time-Stamp Counter (RDTSC)
__asm __emit 0x0F
__asm __emit 0x31
__asm mov dword ptr[RDTSC] ,eax
__asm mov dword ptr[RDTSC+4],edx
return RDTSC;
}
You might find what you wnat here :)
http://book.itzero.com/read/microsoft/0507/Microsoft.Press.Microsoft.Windows.Internals.Fourth.Edition.Dec.2004.internal.eBook-DDU_html/
--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
-- Aravindh
PS: I'm not sure if netiquette allows me to thank one person by posting
to the usenet rather than sending a private email. But I'll follow this
until someone complains of annoyance.
> In Peter Norvig's essay, "Teach Yourself Programming in Ten Years",
>he says it is a good idea for anybody interested in CS to find out
>certain details like -- " Know how long it takes your computer to
>execute an instruction, fetch a word from memory (with and without a
>cache miss), read consecutive words from disk, and seek to a new
>location on disk." He gives a table(values from 2001) for his PC.
> Can anyone tell me how to determine this accurately for a PC?
In modern hierarchical memory systems, access times are location and
order dependent. Factors affecting access time include such things as
address translation, cache associativity and line size, bus contention
from DMA devices, even RAM refresh cycles. The worst case access time
is predictable for a given hardware configuration, but the analysis is
worthless WRT any other configuration.
WRT processors: modern CPUs with their deep pipelines, multiple
function units and speculative execution make instruction timing
somewhere between very hard and impossible. Intel, for example, no
longer even publishes instruction cycle times for IA-32/64 processors
because there are so many variables.
In some cases an instruction can be seen to effectively execute in
_zero_ cycles (as measured externally) because it's execution was
entirely subsumed by parallel execution of a longer running
instruction. A short sequence of instructions that beats on a
particular functional unit can be slower than a much longer sequence
that spreads over multiple units. Register pressure can cause bubbles
in the pipeline. A cache load miss can stall a pipeline for dozens or
hundreds of cycles ... an exception that flushes the pipeline can take
thousands of cycles.
CPU cycle counters, if you have them, are the best source of very
precise timing information, but keep in mind that they are valid only
for *uninterrupted* instruction sequences. No processor I am aware of
preserves cycle counters across context switches so any multitasking
[even interrupts] may render them useless for a particular purpose.
Also, there is no guarantee that all the preceding instructions will
have finished when the cycle counter is read - Intel processors, for
example, do _not_ wait for other instructions to finish.
Norvig's suggestion is a good one, but in general it's impossible to
achieve outside of a very carefully controlled execution environment
where the performance details for every component are either known or
can be explored.
George
--
for email reply remove "/" from address
> In Peter Norvig's essay, "Teach Yourself Programming in Ten Years",
> he says it is a good idea for anybody interested in CS to find out
> certain details like -- " Know how long it takes your computer to
> execute an instruction, fetch a word from memory (with and without a
> cache miss), read consecutive words from disk, and seek to a new
> location on disk." He gives a table(values from 2001) for his PC.
> Can anyone tell me how to determine this accurately for a PC?
Jon Bentley explains how to do something similar in his books
"Programming Pearls" (see appendix 3 "Cost Models for Time and Space"
in the second edition) and "More Programming Pearls" (section 7.2
"Performance Rules of Thumb).
Paolo
--
Why Lisp? http://wiki.alu.org/RtL%20Highlight%20Film
Recommended Common Lisp libraries/tools:
- ASDF/ASDF-INSTALL: system building/installation
- CL-PPCRE: regular expressions
- CFFI: Foreign Function Interface
One popular set of microbenchmarks for this sort of thing is "LMbench":
http://www.bitmover.com/lmbench/
http://www.bitmover.com/lmbench/whatis_lmbench.html
http://www.bitmover.com/lmbench/why_lmbench.html
As the latter page notes, LMbench includes:
* Memory latency results
The memory latency test shows the latency of all of the system
(data) caches, i.e., level 1, 2, and 3, if present, as well as
main memory and TLB miss latency. In addition the sizes of the
caches can be read off of a properly plotted set of results.
The hardware folks like this. This benchmark has found bugs in
operating system page coloring schemes.
-Rob
p.s. Disclaimer: I know the original author. He taught me a lot
about playing 9-ball... ;-} ;-}
-----
Rob Warnock <rp...@rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607
Although many people have explained why this is now mostly impossible
with modern CPU and memory architectures, even simpleminded versions of
these experiments will give you some order-of-magnitude approximations
that will be useful for back-of-the-envelope thinking. If you have an
idea of which operations take nanoseconds versus microseconds versus
milliseconds you'll be in much better shape than someone who doesn't.
(And of course there are always the cases where the simple analysis is
wrong, but that's what testing is for.)
paul
I will give a slightly different answer than most of the others. After
you have acquainted yourself with some literature about the micro
architecture of modern processors (read the Intel manuals for example),
you will get a decent picture about timings of different instructions.
However, in actual code, non-predicted branches and cache-misses are
often what consumes most of the time, so you should use a statistical
profiler in order to learn the relative speed of the different
instructions for a given function. A statistical profiler will use
built-in hardware or software interrupts to sample where the instruction
pointer is at a semi-constant rate. By running a piece of code lots
of times, the samples will give a picture of where time is mostly spent.
For SBCL I would recommend the sb-sprof package. It is beautifully
integrated with the SBCL disassembler and will give you the number of
samples that were collected for each instruction.
Example:
decibel:~:(9)$ sbcl
This is SBCL 0.9.6, an implementation of ANSI Common Lisp.
More information about SBCL is available at <http://www.sbcl.org/>.
SBCL is free software, provided as is, with absolutely no warranty.
It is mostly in the public domain; some portions are provided under
BSD-style licenses. See the CREDITS and COPYING files in the
distribution for more information.
* (defun foo (a b c) (declare ((integer 0 50) a b c)) (+ a (* b c)))
FOO
* (require :sb-sprof)
("SB-SPROF")
* (progn
(sb-sprof:start-profiling)
(dotimes (i 300000000) (foo 2 3 4))
(sb-sprof:stop-profiling))
* (disassemble 'foo)
; 03A9628B: 488BC7 MOV RAX, RDI ;
no-arg-parsing entry point
; 8E: 48C1F803 SAR RAX, 3
; 92: 480FAFC6 IMUL RAX, RSI ; 24/369
samples
; 96: 488D1403 LEA RDX, [RBX+RAX] ; 17/369
samples
; 9A: 488B4DF0 MOV RCX, [RBP-16]
; 9E: 488B45F8 MOV RAX, [RBP-8]
; A2: 4883C103 ADD RCX, 3 ; 18/369
samples
; A6: 488BE5 MOV RSP, RBP
; A9: 488BE8 MOV RBP, RAX
; AC: FFE1 JMP RCX ; 9/369
samples
; AE: 90 NOP
; AF: 90 NOP
; B0: CC0A BREAK 10 ; error trap
; B2: 02 BYTE #X02
; B3: 18 BYTE #X18 ;
INVALID-ARG-COUNT-ERROR
; B4: CE BYTE #XCE ; RBX
; B5: CC0A BREAK 10 ; error trap
; B7: 02 BYTE #X02
; B8: 18 BYTE #X18 ;
INVALID-ARG-COUNT-ERROR
; B9: 4E BYTE #X4E ; RCX
;
NIL
Here you see that for a lot of the instructions, the sampler never saw
the instruction pointer at their address. This does not mean that they
take zero time, but they are probably executed in the same clock cycle
as the next instruction. The CPU will retire several instructions every
clock cycle. Be careful about trusting the samples for the JMP - the
time is often accounted for at the target location (which is not shown
here).
Now if we change FOO to do some calculations and put the answer in a
CONS-cell:
* (defun foo (a b c) (declare ((integer 0 50) a b c)) (cons nil (+ a (*
b c))))
STYLE-WARNING: redefining FOO in DEFUN
FOO
* (progn
(sb-sprof:start-profiling)
(dotimes (i 300000000) (foo 2 3 4))
(sb-sprof:stop-profiling))
* (disassemble 'foo)
; 02BB1B3B: 488BC7 MOV RAX, RDI ;
no-arg-parsing entry point
; 3E: 48C1F803 SAR RAX, 3
; 42: 480FAFC6 IMUL RAX, RSI ; 17/1949
samples
; 46: 488D1403 LEA RDX, [RBX+RAX] ; 15/1949
samples
; 4A: 41C684249800000000 MOV BYTE PTR [R12+152], 0
(CLEAR INTERRUPT FLAG?)
; 53: 41C684249000000008 MOV BYTE PTR [R12+144], 8
(INDICATE UNINTERRUPTIBLE)
; 5C: 48C7C110000000 MOV RCX, 16
; 63: 49034C2440 ADD RCX, [R12+64]
; 68: 49394C2448 CMP [R12+72], RCX
; 6D: 765B JBE L2
(CALCULATE NEW END-OF-HEAP
AND SEE IF WE NEED TO
ALLOCATE MOVE MEMORY)
; 6F: 49874C2440 XCHG RCX, [R12+64]
(GET ALLOCATED MEMORY/
UPDATE END-OF-HEAP)
; 74: L0: 488D4907 LEA RCX, [RCX+7]
; 78: 41C684249000000000 MOV BYTE PTR [R12+144], 0
; 81: 4180BC249800000000 CMP BYTE PTR [R12+152], 0
; 8A: 7402 JEQ L1
; 8C: CC09 BREAK 9 ; pending
interrupt trap
(INDICATE INTERRUPTIBLE/
HANDLE INTERRUPTS)
; 8E: L1: 48B81700004000000000 MOV RAX, 1073741847 ; 509/1949
samples
; 98: 488941F9 MOV [RCX-7], RAX ; 510/1949
samples
; 9C: 48895101 MOV [RCX+1], RDX ; 105/1949
samples
; A0: 488BD1 MOV RDX, RCX ; 8/1949
samples
; A3: 488B4DF0 MOV RCX, [RBP-16] ; 13/1949
samples
; A7: 488B45F8 MOV RAX, [RBP-8]
; AB: 4883C103 ADD RCX, 3
; AF: 488BE5 MOV RSP, RBP ; 23/1949
samples
; B2: 488BE8 MOV RBP, RAX
; B5: FFE1 JMP RCX
; B7: 90 NOP
; B8: 90 NOP
; B9: 90 NOP
; BA: 90 NOP
; BB: 90 NOP
; BC: 90 NOP
; BD: 90 NOP
; BE: 90 NOP
; BF: 90 NOP
; C0: CC0A BREAK 10 ; error trap
; C2: 02 BYTE #X02
; C3: 18 BYTE #X18 ;
INVALID-ARG-COUNT-ERROR
; C4: CE BYTE #XCE ; RBX
; C5: CC0A BREAK 10 ; error trap
; C7: 02 BYTE #X02
; C8: 18 BYTE #X18 ;
INVALID-ARG-COUNT-ERROR
; C9: 4E BYTE #X4E ; RCX
; CA: L2: 6A10 PUSH 16
; CC: 4C8D2C2560AA4100 LEA R13, [#x41AA60] ; alloc_tramp
; D4: 41FFD5 CALL R13
; D7: 59 POP RCX
; D8: EB9A JMP L0
;
NIL
Here we see that there is a whole set of instructions that have no
samples. These are not sampled because interrupts are turned off (I
think?) and are accounted for by the large sample count for the MOV at
L1. The sample for that MOV instruction should therefore be taken as
the total over the whole allocation region.
However, the large sample counts for the MOVs at address 98 and 9C are
very real. They represent cache-misses for newly allocated CONS cells!
Usually the first MOV will take the hit, but some times a CONS cell
might cross a cache-line and both the first and the second MOV will
stall in this case.
Actually, since each cache-line is 64 bytes on the Athlon 64/Opteron and
a CONS cell is 16 bytes, we would expect a CONS-cell to cross a
cache-line boundary one out of 4 times. The 5:1 ratio between the 98/9C
MOVs supports the theory that most time is spent waiting for memory.
So what do you lean from this?
You learn that a multiplication instruction is about 60 times cheaper
than allocating and populating a CONS cell in SBCL (17 vs 509+510+105
samples and that is not counting GC time spent reclaiming the garbage).
You learn that the amortized time to to lots of MOVs, CMP etc to
allocate a CONS cell is less than the time spent waiting to access the
memory that was allocated (509 vs 510+105 cycles). This is because all
the memory accessed during allocation is cached, while the allocated
memory is usually not cached.
You learn that the multiplication is slightly slower than addition, but
that it probably does not matter and that these operations will be
swamped by other stuff unless you concentrate on the inner loop (24 vs
17 samples).
This is all handy to know.
astor
SB-SPROF:WITH-PROFILING might be more appropriate than explicit
START/STOP-PROFILING calls in this case.
> ; 6F: 49874C2440 XCHG RCX, [R12+64]
> (GET ALLOCATED MEMORY/
> UPDATE END-OF-HEAP)
"Oops". That XCHG is a remnant of the early days of the x86-64 port,
when no temporary register was usable here. A register/memory XCHG has
a very high latency, so consing could be sped up significantly by
rearranging the code a bit. Based on a quick hack, it looks like your
example program would get about 30% faster.
The moral of the story: even if a statistical profiler is available,
it sometimes pays to read the instruction timings from the
processor's documentation :-)
> Actually, since each cache-line is 64 bytes on the Athlon 64/Opteron and
> a CONS cell is 16 bytes, we would expect a CONS-cell to cross a
> cache-line boundary one out of 4 times. The 5:1 ratio between the 98/9C
> MOVs supports the theory that most time is spent waiting for memory.
Good theory, but consed memory is always 16-byte aligned on 64-bit
SBCL.
--
Juho Snellman
-- Aravindh
It's important to understand that the memory latency is huge compared to the
processor speed. For instance here are the values for my notebook (2.0GHz
Pentium M with benchmark software from http://www.sciencemark.org/):
L1 Latency: 3 cycles / 1.50 ns / 32 byte stride
L2 Latency: 3 cycles / 1.50 ns / 4 byte stride
L2 Latency: 4 cycles / 2.01 ns / 16 byte stride
L2 Latency: 10 cycles / 5.02 ns / 64 byte stride
L2 Latency: 10 cycles / 5.02 ns / 256 byte stride
L2 Latency: 10 cycles / 5.02 ns / 512 byte stride
Mem Latency: 4 cycles / 2.01 ns / 4 byte stride
Mem Latency: 16 cycles / 7.52 ns / 16 byte stride
Mem Latency: 59 cycles / 30.10 ns / 64 byte stride
Mem Latency: 197 cycles / 96.31 ns / 256 byte stride
Mem Latency: 195 cycles / 100.83 ns / 512 byte stride
L1 BW: 24836.98 MB/s
L2 BW: 10156.85 MB/s
Mem BW: 2075.21 MB/s
As you see, you can easily have almost 200 processor cycles to fetch data
from memory.
You can read the software optimization manual for the Intel, IBM PPC and AMD
to have more info on how it works. For instance the AMD one is here:
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/2511
2.PDF
You can also read how the DDR/DDR2 SDRAM works from the RAM manufacturers'
sites as it's rather complex now.
Marc
In addition to empirical timings, I highly recommend reading at least
one architecture spec, and one processor implementation manual. I can
highly recommend either SPARC and Sun's implementation, or Power PC
and Motorolla's implementation.
--
/|_ .-----------------------.
,' .\ / | Free Mumia Abu-Jamal! |
,--' _,' | Abolish the racist |
/ / | death penalty! |
( -. | `-----------------------'
| ) |
(`-. '--.)
`. )----'