I remember looking into this recently, and I think it's less and less true that QPC is expensive. On all "recent" x86/x64 CPUs, the cycle counter (RDTSC) has been fixed to make a reliable, low-cost, low-power, high-resolution, accurate wall-clock timer (see [
http://en.wikipedia.org/wiki/Time_Stamp_Counter#Implementation_in_various_processors]). On CPUs where this is true, running recent systems (probably Vista or "better"), QPC is a bit of function preamble, followed by an RDTSC instruction in user mode.
On Windows XP QPC is still a system call, but the system call then subsequently calls RDTSC where available.
0:000> uf kernel32!queryperformancecounter
kernel32!QueryPerformanceCounter:
76c31732 ff25d40dc376 jmp dword ptr [kernel32!_imp__QueryPerformanceCounter (76c30dd4)]
ntdll!RtlQueryPerformanceCounter:
777d88a4 8bff mov edi,edi
777d88a6 55 push ebp
777d88a7 8bec mov ebp,esp
777d88a9 51 push ecx
777d88aa 51 push ecx
777d88ab f605ed02fe7f01 test byte ptr [SharedUserData+0x2ed (7ffe02ed)],1
777d88b2 0f844bf50400 je ntdll!RtlQueryPerformanceCounter+0x55 (77827e03)
ntdll!RtlQueryPerformanceCounter+0x10:
777d88b8 56 push esi
ntdll!RtlQueryPerformanceCounter+0x11:
777d88b9 8b0db803fe7f mov ecx,dword ptr [SharedUserData+0x3b8 (7ffe03b8)]
777d88bf 8b35bc03fe7f mov esi,dword ptr [SharedUserData+0x3bc (7ffe03bc)]
777d88c5 a1b803fe7f mov eax,dword ptr [SharedUserData+0x3b8 (7ffe03b8)]
777d88ca 8b15bc03fe7f mov edx,dword ptr [SharedUserData+0x3bc (7ffe03bc)]
777d88d0 3bc8 cmp ecx,eax
777d88d2 75e5 jne ntdll!RtlQueryPerformanceCounter+0x11 (777d88b9)
ntdll!RtlQueryPerformanceCounter+0x2c:
777d88d4 3bf2 cmp esi,edx
777d88d6 75e1 jne ntdll!RtlQueryPerformanceCounter+0x11 (777d88b9)
ntdll!RtlQueryPerformanceCounter+0x30:
777d88d8 0f31 rdtsc
[snip]
On an XP SP2 system this is still a system call:
0:000> uf kernel32!QueryPerformanceCounter
kernel32!QueryPerformanceCounter:
7c80a4c7 8bff mov edi,edi
7c80a4c9 55 push ebp
7c80a4ca 8bec mov ebp,esp
7c80a4cc 51 push ecx
7c80a4cd 51 push ecx
7c80a4ce 8d45f8 lea eax,[ebp-8]
7c80a4d1 50 push eax
7c80a4d2 ff7508 push dword ptr [ebp+8]
7c80a4d5 ff15dc13807c call dword ptr [kernel32!_imp__NtQueryPerformanceCounter (7c8013dc)]
7c80a4db 85c0 test eax,eax
7c80a4dd 0f8c17760300 jl kernel32!QueryPerformanceCounter+0x18 (7c841afa)
[snip]
But the system call subsequently uses RDTSC where available - in kernel this bottlenecks in hal!KeQueryPerformanceCounter:
lkd> uf hal!KeQueryPerformanceCounter
hal!HalpAcpiTimerQueryPerfCount:
806e5c78 a0dce06e80 mov al,byte ptr [hal!HalpUse8254 (806ee0dc)]
806e5c7d 0ac0 or al,al
806e5c7f 752d jne hal!HalpAcpiTimerQueryPerfCount+0x36 (806e5cae)
hal!HalpAcpiTimerQueryPerfCount+0x9:
806e5c81 8b4c2404 mov ecx,dword ptr [esp+4]
806e5c85 0bc9 or ecx,ecx
806e5c87 7412 je hal!HalpAcpiTimerQueryPerfCount+0x23 (806e5c9b)
hal!HalpAcpiTimerQueryPerfCount+0x11:
806e5c89 64a1a4000000 mov eax,dword ptr fs:[000000A4h]
806e5c8f 648b15a8000000 mov edx,dword ptr fs:[0A8h]
806e5c96 8901 mov dword ptr [ecx],eax
806e5c98 895104 mov dword ptr [ecx+4],edx
hal!HalpAcpiTimerQueryPerfCount+0x23:
806e5c9b 0f31 rdtsc
806e5c9d 640305ac000000 add eax,dword ptr fs:[0ACh]
806e5ca4 641315b0000000 adc edx,dword ptr fs:[0B0h]
806e5cab c20400 ret 4
[snip fallback]
Siggi