-- Aravindh
Although you could look the instruction timings in the CPU
manufacturers documentation, stuff like cache access is going to be
very variable. It's going to depend on a bunch of factors that are
different on every PC.
If you want to do it yourself, the Pentium has a timer which you can
use getting the current cycle count, which can be used to time sections
of code. A code sample below shows a C/C++ function that would return
that value.
You then need to set up your code so that you can do timings you want
by forcing cache misses, which will be fairly involved. In addition the
operating system will be busy doing things in the background and throw
your timings off.
// Warning :: Assumes you have one!
__int64 GetPentiumTimeStampCounter()
{
__int64 RDTSC;
	 // read the Pentium Time-Stamp Counter (RDTSC)
	 __asm __emit 0x0F
         __asm __emit 0x31
	 __asm mov dword ptr[RDTSC]  ,eax
	 __asm mov dword ptr[RDTSC+4],edx
return RDTSC;
}
You might find what you wnat here :)
http://book.itzero.com/read/microsoft/0507/Microsoft.Press.Microsoft.Windows.Internals.Fourth.Edition.Dec.2004.internal.eBook-DDU_html/
-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
-- Aravindh
PS: I'm not sure if netiquette allows me to thank one person by posting
to the usenet rather than sending a private email. But I'll follow this
until someone complains of annoyance.
>    In Peter Norvig's essay, "Teach Yourself Programming in Ten Years",
>he says it is a good idea for anybody interested in CS to find out
>certain details like -- " Know how long it takes your computer to
>execute an instruction, fetch a word from memory (with and without a
>cache miss), read consecutive words from disk, and seek to a new
>location on disk." He gives a table(values from 2001) for his PC.
>    Can anyone tell me how to determine this accurately for a PC?
In modern hierarchical memory systems, access times are location and
order dependent.  Factors affecting access time include such things as
address translation, cache associativity and line size, bus contention
from DMA devices, even RAM refresh cycles.  The worst case access time
is predictable for a given hardware configuration, but the analysis is
worthless WRT any other configuration.
WRT processors: modern CPUs with their deep pipelines, multiple
function units and speculative execution make instruction timing
somewhere between very hard and impossible.  Intel, for example, no
longer even publishes instruction cycle times for IA-32/64 processors
because there are so many variables.
In some cases an instruction can be seen to effectively execute in
_zero_ cycles (as measured externally) because it's execution was
entirely subsumed by parallel execution of a longer running
instruction.  A short sequence of instructions that beats on a
particular functional unit can be slower than a much longer sequence
that spreads over multiple units.  Register pressure can cause bubbles
in the pipeline.  A cache load miss can stall a pipeline for dozens or
hundreds of cycles ... an exception that flushes the pipeline can take
thousands of cycles.
CPU cycle counters, if you have them, are the best source of very
precise timing information, but keep in mind that they are valid only
for *uninterrupted* instruction sequences.  No processor I am aware of
preserves cycle counters across context switches so any multitasking
[even interrupts] may render them useless for a particular purpose.
Also, there is no guarantee that all the preceding instructions will
have finished when the cycle counter is read - Intel processors, for
example, do _not_ wait for other instructions to finish.
Norvig's suggestion is a good one, but in general it's impossible to
achieve outside of a very carefully controlled execution environment
where the performance details for every component are either known or
can be explored.
George
--
for email reply remove "/" from address
>     In Peter Norvig's essay, "Teach Yourself Programming in Ten Years",
> he says it is a good idea for anybody interested in CS to find out
> certain details like -- " Know how long it takes your computer to
> execute an instruction, fetch a word from memory (with and without a
> cache miss), read consecutive words from disk, and seek to a new
> location on disk." He gives a table(values from 2001) for his PC.
>     Can anyone tell me how to determine this accurately for a PC?
Jon Bentley explains how to do something similar in his books
"Programming Pearls" (see appendix 3 "Cost Models for Time and Space"
in the second edition) and "More Programming Pearls" (section 7.2
"Performance Rules of Thumb).
Paolo
-- 
Why Lisp? http://wiki.alu.org/RtL%20Highlight%20Film
Recommended Common Lisp libraries/tools:
- ASDF/ASDF-INSTALL: system building/installation
- CL-PPCRE: regular expressions
- CFFI: Foreign Function Interface
One popular set of microbenchmarks for this sort of thing is "LMbench":
    http://www.bitmover.com/lmbench/
    http://www.bitmover.com/lmbench/whatis_lmbench.html
    http://www.bitmover.com/lmbench/why_lmbench.html
As the latter page notes, LMbench includes:
    * Memory latency results
      The memory latency test shows the latency of all of the system
      (data) caches, i.e., level 1, 2, and 3, if present, as well as
      main memory and TLB miss latency. In addition the sizes of the
      caches can be read off of a properly plotted set of results.
      The hardware folks like this. This benchmark has found bugs in
      operating system page coloring schemes. 
-Rob
p.s. Disclaimer: I know the original author. He taught me a lot
about playing 9-ball...  ;-}  ;-}
-----
Rob Warnock			<rp...@rpw3.org>
627 26th Avenue			<URL:http://rpw3.org/>
San Mateo, CA 94403		(650)572-2607
Although many people have explained why this is now mostly impossible 
with modern CPU and memory architectures, even simpleminded versions of 
these experiments will give you some order-of-magnitude approximations 
that will be useful for back-of-the-envelope thinking. If you have an 
idea of which operations take nanoseconds versus microseconds versus 
milliseconds you'll be in much better shape than someone who doesn't. 
(And of course there are always the cases where the simple analysis is 
wrong, but that's what testing is for.)
paul
I will give a slightly different answer than most of the others.  After 
you have acquainted yourself with some literature about the micro 
architecture of modern processors (read the Intel manuals for example), 
you will get a decent picture about timings of different instructions.
However, in actual code, non-predicted branches and cache-misses are 
often what consumes most of the time, so you should use a statistical 
profiler in order to learn the relative speed of the different 
instructions for a given function.  A statistical profiler will use 
built-in hardware or software interrupts to sample where the instruction 
  pointer is at a semi-constant rate.  By running a piece of code lots 
of times, the samples will give a picture of where time is mostly spent.
For SBCL I would recommend the sb-sprof package.  It is beautifully 
integrated with the SBCL disassembler and will give you the number of 
samples that were collected for each instruction.
Example:
decibel:~:(9)$ sbcl
This is SBCL 0.9.6, an implementation of ANSI Common Lisp.
More information about SBCL is available at <http://www.sbcl.org/>.
SBCL is free software, provided as is, with absolutely no warranty.
It is mostly in the public domain; some portions are provided under
BSD-style licenses.  See the CREDITS and COPYING files in the
distribution for more information.
* (defun foo (a b c) (declare ((integer 0 50) a b c)) (+ a (* b c)))
FOO
* (require :sb-sprof)
("SB-SPROF")
* (progn
(sb-sprof:start-profiling)
(dotimes (i 300000000) (foo 2 3 4))
(sb-sprof:stop-profiling))
* (disassemble 'foo)
; 03A9628B:       488BC7           MOV RAX, RDI               ; 
no-arg-parsing entry point
;       8E:       48C1F803         SAR RAX, 3
;       92:       480FAFC6         IMUL RAX, RSI              ; 24/369 
samples
;       96:       488D1403         LEA RDX, [RBX+RAX]         ; 17/369 
samples
;       9A:       488B4DF0         MOV RCX, [RBP-16]
;       9E:       488B45F8         MOV RAX, [RBP-8]
;       A2:       4883C103         ADD RCX, 3                 ; 18/369 
samples
;       A6:       488BE5           MOV RSP, RBP
;       A9:       488BE8           MOV RBP, RAX
;       AC:       FFE1             JMP RCX                    ; 9/369 
samples
;       AE:       90               NOP
;       AF:       90               NOP
;       B0:       CC0A             BREAK 10                   ; error trap
;       B2:       02               BYTE #X02
;       B3:       18               BYTE #X18                  ; 
INVALID-ARG-COUNT-ERROR
;       B4:       CE               BYTE #XCE                  ; RBX
;       B5:       CC0A             BREAK 10                   ; error trap
;       B7:       02               BYTE #X02
;       B8:       18               BYTE #X18                  ; 
INVALID-ARG-COUNT-ERROR
;       B9:       4E               BYTE #X4E                  ; RCX
;
NIL
Here you see that for a lot of the instructions, the sampler never saw 
the instruction pointer at their address.  This does not mean that they 
take zero time, but they are probably executed in the same clock cycle 
as the next instruction.  The CPU will retire several instructions every 
clock cycle.  Be careful about trusting the samples for the JMP - the 
time is often accounted for at the target location (which is not shown 
here).
Now if we change FOO to do some calculations and put the answer in a 
CONS-cell:
* (defun foo (a b c) (declare ((integer 0 50) a b c)) (cons nil (+ a (* 
b c))))
STYLE-WARNING: redefining FOO in DEFUN
FOO
* (progn
(sb-sprof:start-profiling)
(dotimes (i 300000000) (foo 2 3 4))
(sb-sprof:stop-profiling))
* (disassemble 'foo)
; 02BB1B3B:       488BC7           MOV RAX, RDI               ; 
no-arg-parsing entry point
;       3E:       48C1F803         SAR RAX, 3
;       42:       480FAFC6         IMUL RAX, RSI              ; 17/1949 
samples
;       46:       488D1403         LEA RDX, [RBX+RAX]         ; 15/1949 
samples
;       4A:       41C684249800000000 MOV BYTE PTR [R12+152], 0
                                           (CLEAR INTERRUPT FLAG?)
;       53:       41C684249000000008 MOV BYTE PTR [R12+144], 8
                                           (INDICATE UNINTERRUPTIBLE)
;       5C:       48C7C110000000   MOV RCX, 16
;       63:       49034C2440       ADD RCX, [R12+64]
;       68:       49394C2448       CMP [R12+72], RCX
;       6D:       765B             JBE L2
                                           (CALCULATE NEW END-OF-HEAP
                                            AND SEE IF WE NEED TO
                                            ALLOCATE MOVE MEMORY)
;       6F:       49874C2440       XCHG RCX, [R12+64]
                                           (GET ALLOCATED MEMORY/
                                            UPDATE END-OF-HEAP)
;       74: L0:   488D4907         LEA RCX, [RCX+7]
;       78:       41C684249000000000 MOV BYTE PTR [R12+144], 0
;       81:       4180BC249800000000 CMP BYTE PTR [R12+152], 0
;       8A:       7402             JEQ L1
;       8C:       CC09             BREAK 9                    ; pending 
interrupt trap
                                            (INDICATE INTERRUPTIBLE/
                                             HANDLE INTERRUPTS)
;       8E: L1:   48B81700004000000000 MOV RAX, 1073741847    ; 509/1949 
samples
;       98:       488941F9         MOV [RCX-7], RAX           ; 510/1949 
samples
;       9C:       48895101         MOV [RCX+1], RDX           ; 105/1949 
samples
;       A0:       488BD1           MOV RDX, RCX               ; 8/1949 
samples
;       A3:       488B4DF0         MOV RCX, [RBP-16]          ; 13/1949 
samples
;       A7:       488B45F8         MOV RAX, [RBP-8]
;       AB:       4883C103         ADD RCX, 3
;       AF:       488BE5           MOV RSP, RBP               ; 23/1949 
samples
;       B2:       488BE8           MOV RBP, RAX
;       B5:       FFE1             JMP RCX
;       B7:       90               NOP
;       B8:       90               NOP
;       B9:       90               NOP
;       BA:       90               NOP
;       BB:       90               NOP
;       BC:       90               NOP
;       BD:       90               NOP
;       BE:       90               NOP
;       BF:       90               NOP
;       C0:       CC0A             BREAK 10                   ; error trap
;       C2:       02               BYTE #X02
;       C3:       18               BYTE #X18                  ; 
INVALID-ARG-COUNT-ERROR
;       C4:       CE               BYTE #XCE                  ; RBX
;       C5:       CC0A             BREAK 10                   ; error trap
;       C7:       02               BYTE #X02
;       C8:       18               BYTE #X18                  ; 
INVALID-ARG-COUNT-ERROR
;       C9:       4E               BYTE #X4E                  ; RCX
;       CA: L2:   6A10             PUSH 16
;       CC:       4C8D2C2560AA4100 LEA R13, [#x41AA60]        ; alloc_tramp
;       D4:       41FFD5           CALL R13
;       D7:       59               POP RCX
;       D8:       EB9A             JMP L0
;
NIL
Here we see that there is a whole set of instructions that have no 
samples.  These are not sampled because interrupts are turned off (I 
think?) and are accounted for by the large sample count for the MOV at 
L1.  The sample for that MOV instruction should therefore be taken as 
the total over the whole allocation region.
However, the large sample counts for the MOVs at address 98 and 9C are 
very real.  They represent cache-misses for newly allocated CONS cells! 
  Usually the first MOV will take the hit, but some times a CONS cell 
might cross a cache-line and both the first and the second MOV will 
stall in this case.
Actually, since each cache-line is 64 bytes on the Athlon 64/Opteron and 
a CONS cell is 16 bytes, we would expect a CONS-cell to cross a 
cache-line boundary one out of 4 times.  The 5:1 ratio between the 98/9C 
MOVs supports the theory that most time is spent waiting for memory.
So what do you lean from this?
You learn that a multiplication instruction is about 60 times cheaper 
than allocating and populating a CONS cell in SBCL (17 vs 509+510+105 
samples and that is not counting GC time spent reclaiming the garbage).
You learn that the amortized time to to lots of MOVs, CMP etc to 
allocate a CONS cell is less than the time spent waiting to access the 
memory that was allocated (509 vs 510+105 cycles).  This is because all 
the memory accessed during allocation is cached, while the allocated 
memory is usually not cached.
You learn that the multiplication is slightly slower than addition, but 
that it probably does not matter and that these operations will be 
swamped by other stuff unless you concentrate on the inner loop (24 vs 
17 samples).
This is all handy to know.
astor
SB-SPROF:WITH-PROFILING might be more appropriate than explicit 
START/STOP-PROFILING calls in this case.
> ;       6F:       49874C2440       XCHG RCX, [R12+64]
>                                            (GET ALLOCATED MEMORY/
>                                             UPDATE END-OF-HEAP)
"Oops". That XCHG is a remnant of the early days of the x86-64 port,
when no temporary register was usable here. A register/memory XCHG has
a very high latency, so consing could be sped up significantly by
rearranging the code a bit. Based on a quick hack, it looks like your
example program would get about 30% faster.
The moral of the story: even if a statistical profiler is available, 
it sometimes pays to read the instruction timings from the
processor's documentation :-)
> Actually, since each cache-line is 64 bytes on the Athlon 64/Opteron and 
> a CONS cell is 16 bytes, we would expect a CONS-cell to cross a 
> cache-line boundary one out of 4 times.  The 5:1 ratio between the 98/9C 
> MOVs supports the theory that most time is spent waiting for memory.
Good theory, but consed memory is always 16-byte aligned on 64-bit 
SBCL.
-- 
Juho Snellman
-- Aravindh
It's important to understand that the memory latency is huge compared to the
processor speed. For instance here are the values for my notebook (2.0GHz
Pentium M with benchmark software from http://www.sciencemark.org/):
L1  Latency: 3 cycles / 1.50 ns / 32 byte stride
L2  Latency: 3 cycles / 1.50 ns / 4 byte stride
L2  Latency: 4 cycles / 2.01 ns / 16 byte stride
L2  Latency: 10 cycles / 5.02 ns / 64 byte stride
L2  Latency: 10 cycles / 5.02 ns / 256 byte stride
L2  Latency: 10 cycles / 5.02 ns / 512 byte stride
Mem Latency: 4 cycles / 2.01 ns / 4 byte stride
Mem Latency: 16 cycles / 7.52 ns / 16 byte stride
Mem Latency: 59 cycles / 30.10 ns / 64 byte stride
Mem Latency: 197 cycles / 96.31 ns / 256 byte stride
Mem Latency: 195 cycles / 100.83 ns / 512 byte stride
L1  BW: 24836.98 MB/s
L2  BW: 10156.85 MB/s
Mem BW: 2075.21 MB/s
As you see, you can easily have almost 200 processor cycles to fetch data
from memory.
You can read the software optimization manual for the Intel, IBM PPC and AMD
to have more info on how it works. For instance the AMD one is here:
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/2511
2.PDF
You can also read how the DDR/DDR2 SDRAM works from the RAM manufacturers'
sites as it's rather complex now.
Marc
In addition to empirical timings, I highly recommend reading at least
one architecture spec, and one processor implementation manual.  I can
highly recommend either SPARC and Sun's implementation, or Power PC
and Motorolla's implementation.
-- 
           /|_     .-----------------------.                        
         ,'  .\  / | Free Mumia Abu-Jamal! |
     ,--'    _,'   | Abolish the racist    |
    /       /      | death penalty!        |
   (   -.  |       `-----------------------'
   |     ) |                               
  (`-.  '--.)                              
   `. )----'