Yet another handmade POPCNT

hopcode

unread,

Jan 4, 2011, 7:50:03 PM1/4/11

to

Hi all,
while testing SSE PHADDx instructions i came upon the idea
(most like a joke) to use it for a POPCNT instruction.

i would be pleased to discuss with You the following
population count snippet. ok, the snippet first:

movdqa xmm3,dqword[mynum] ; 30303030'22222222h,0ABB0832'11440C20h
movdqa xmm2,dqword[mask_0F] ; 0F0F0F0F'0F0F0F0Fh,0F0F0F0F'0F0F0F0Fh
movdqa xmm1,xmm3 ; 30303030'22222222h,0ABB0832'11440C20h
psrlw xmm1,4 ; 03030303'02220222h,00AB0083'011400C2h
pand xmm3,xmm2 ; 00000000'02020202h,0A0B0802'01040C00h
pand xmm1,xmm2 ; 03030303'02020202h,000B0003'01040002h
movdqa xmm0,dqword[mask_bt] ; 04030302'03020201h,03020201'02010100h
movdqa xmm2,xmm0 ; 04030302'03020201h,03020201'02010100h
pshufb xmm0,xmm1 ; 02020202'01010101h,00030002'01010001h
pshufb xmm2,xmm3 ; 00000000'01010101h,02030101'01010200h
paddb xmm0,xmm2 ; 02020202'02020202h,02060103'02020201h
phaddd xmm0,xmm0 ; 04040404'04080304h,04040404'04080304h
phaddw xmm0,xmm0 ; 0808070C'0808070Ch,0808070C'0808070Ch
movdq2q mm0,xmm0 ; 0808070C'0808070Ch
phaddw mm0,mm0 ; 0F140F14'0F140F14h
movd eax,mm0 ; 00000000'0F140F14h
add al,ah ; 00000000'0F140F23h
movzx rax,al ; 00000000'00000023h

Thats'all. it works to calc max 128 bits. Note that i didnt test it
against speed or size, for some reasons i will reserve them for me
till up after having heard what you think about.

Cheers,

.::mrk::.
x64 Assembly Lab
http://sites.google.com/site/x64lab

Terje Mathisen

unread,

Jan 5, 2011, 2:58:08 AM1/5/11

to

If you don't have POPCNT and need to count a _lot_ of bits (i.e. far
more than a single 128-bit block), then the fastest approach is to run a
bitslice algorithm first, merging blocks together using FULL ADDERS to
do so:

Since each fulladd() takes three inputs (a,b,carryin) and results in two
outputs (sum, carryout), it reduces the number of blocks, right?

With a hierarchical set of (inline) fulladd()'s you get a single block
with a number from 0-15 in each nybble of the result, whereupon you can
merge pairs of nybbles into bytes, bytes to words and words to dwords etc.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

hopcode

unread,

Jan 5, 2011, 5:13:53 AM1/5/11

to

Il 05.01.2011 08:58, Terje Mathisen ha scritto:
> If you don't have POPCNT and need to count a _lot_ of bits (i.e. far
> more than a single 128-bit block), then the fastest approach is to run a
> bitslice algorithm first, merging blocks together using FULL ADDERS to
> do so:

If You mean something like that,
http://www.dalkescientific.com/writings/diary/archive/2008/07/05/bitslice_and_popcount.html
/Algo/
http://dalkescientific.com/writings/diary/popcnt.c
(browse at the bottom for bitslice implementation)
well, browsing the comments too, one of them reminds the author how
important is *NOT* neglecting the L1 cache capability (read: misses).

>
> Since each fulladd() takes three inputs (a,b,carryin) and results in two
> outputs (sum, carryout), it reduces the number of blocks, right?
>
> With a hierarchical set of (inline) fulladd()'s you get a single block
> with a number from 0-15 in each nybble of the result, whereupon you can
> merge pairs of nybbles into bytes, bytes to words and words to dwords etc.
>
> Terje

It is ok..., a reduction of blocks..., even using the classical method
mask55/33/0F. One should take always in account the cache, imho,
especially by big bulks of datas to be parsed. Also i dont see so super
performances of that bitslicing method over other ones (example:
http://wm.ite.pl/articles/sse-popcount.html )

But the bitslicing method could turn to be useful on new processors with
shared data amongs the cores.

Cheers,

--

Terje Mathisen

unread,

Jan 5, 2011, 8:41:47 AM1/5/11

to

hopcode wrote:
> Il 05.01.2011 08:58, Terje Mathisen ha scritto:
>> If you don't have POPCNT and need to count a _lot_ of bits (i.e. far
>> more than a single 128-bit block), then the fastest approach is to run a
>> bitslice algorithm first, merging blocks together using FULL ADDERS to
>> do so:
> If You mean something like that,
> http://www.dalkescientific.com/writings/diary/archive/2008/07/05/bitslice_and_popcount.html

Bitslice is useless if you actually have a single-cycle POPCNT opcode,
obviously.

I note in your links that Robert Harley's original 64-bit Alpha code was
used as the reference, except that he used blocks of 15 64-bit words,
i.e. he would split the input into 120-byte buffers.

They key idea behind using 15 blocks was to make sure everything would
fit inside the 32 available registers, i.e. zero spill/fill memory traffic.

On x86 I had to settle with 7-wide blocks in order to make it work with
just 7/8 registers.

> It is ok..., a reduction of blocks..., even using the classical method
> mask55/33/0F. One should take always in account the cache, imho,
> especially by big bulks of datas to be parsed. Also i dont see so super
> performances of that bitslicing method over other ones (example:
> http://wm.ite.pl/articles/sse-popcount.html )
>
> But the bitslicing method could turn to be useful on new processors with
> shared data amongs the cores.

Ref. cache issues:

Originally I also used a simple 8-bit lookup table for bit counting, it
is simple and reasonably efficient.

Using 16-bit lookup tables otoh is probably overkill, since that table
pretty much has to fit in L1 cache, and the input data is assumed to be
pretty much random. (If it isn't random, then a 16 or 32 KB L1 cache can
give very good hit rates indeed, I use that for a very fast word count
program.)

For lookup tables I'd like to try a different split, using an 11-bit
table in three chunks per 32-bit word (11,11,10), since that table will
use maximum 8KB.

(A 16-bit table cannot be compressed into nybbles since all values from
0 to 16 inclusive are possible.)

Terje Mathisen

unread,

Jan 5, 2011, 9:59:48 AM1/5/11

to

hopcode wrote:
> It is ok..., a reduction of blocks..., even using the classical method
> mask55/33/0F. One should take always in account the cache, imho,
> especially by big bulks of datas to be parsed. Also i dont see so super
> performances of that bitslicing method over other ones (example:
> http://wm.ite.pl/articles/sse-popcount.html )
>
> But the bitslicing method could turn to be useful on new processors with
> shared data amongs the cores.

I remembered something and took a look in my old C code directory, where
I found cntbits.c :-)

Turns out I had already written 11 different counting functions, from 8
& 16-bit table lookup to 7 and 15-wide fulladd reductions, in both C and
inline MMX asm.

Except for the 15-wide MMX-asm version which has a bug, they all return
identical results, and the timing varies from 1.10 to 3.14 cycles/byte.

The three top results are 16-bit table lookup and the two 7-wide MMX
implementations, they all take 1.10 to 1.11 cycles/byte.

Using SSE/XMM registers instead of MMX would of course double the width
and should therefore also double the speed, getting close to 0.5
cycles/byte.

James Van Buskirk

unread,

Jan 5, 2011, 1:40:09 PM1/5/11

to

"Terje Mathisen" <"terje.mathisen at tmsw.no"@giganews.com> wrote in message
news:6kdev7-...@ntp6.tmsw.no...

> hopcode wrote:

I started a thread on POPCNT a while back when I discovered the tweak
of using negative logic for the one-bit-wide adder (bitslice) method:

http://groups.google.com/group/comp.lang.asm.x86/browse_frm/thread/958e93d748e48121/8dab77a28ee7f005?hl=en

I was getting something like 6950 clocks/ 32768 bytes or 0.21 clocks
per byte on my Core 2 Duo.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end

Terje Mathisen

unread,

Jan 5, 2011, 4:33:41 PM1/5/11

to

James Van Buskirk wrote:
> "Terje Mathisen"<"terje.mathisen at tmsw.no"@giganews.com> wrote in message

>> Using SSE/XMM registers instead of MMX would of course double the width
>> and should therefore also double the speed, getting close to 0.5
>> cycles/byte.
>
> I started a thread on POPCNT a while back when I discovered the tweak
> of using negative logic for the one-bit-wide adder (bitslice) method:
>
> http://groups.google.com/group/comp.lang.asm.x86/browse_frm/thread/958e93d748e48121/8dab77a28ee7f005?hl=en
>

Very nice...

> I was getting something like 6950 clocks/ 32768 bytes or 0.21 clocks
> per byte on my Core 2 Duo.

I do remember getting stuck at those 7 ops/fulladd, I never went on from
there to using negative logic.

Good job!

James Van Buskirk

unread,

Jan 5, 2011, 8:11:33 PM1/5/11

to

"Terje Mathisen" <"terje.mathisen at tmsw.no"@giganews.com> wrote in message

news:nm4fv7-...@ntp6.tmsw.no...

> James Van Buskirk wrote:

>> "Terje Mathisen"<"terje.mathisen at tmsw.no"@giganews.com> wrote in
>> message

>> I was getting something like 6950 clocks/ 32768 bytes or 0.21 clocks

>> per byte on my Core 2 Duo.

> I do remember getting stuck at those 7 ops/fulladd, I never went on from
> there to using negative logic.

Originally I had thought that negative logic helps my approach more
than yours, but I just had some second thoughts about that.

Here is my take on the difference between our compression methods,
negative logic aside, with a little background for the O.P.'s
benefit:

If we start with three chunks of bits to count, in xmm0, xmm1, and
xmm2, first we would compute the bitwise logical AND of xmm0 and
xmm1 and also their bitwise logical XOR, then store these back in
xmm1 and xmm0 respectively. Then xmm0 would represent the low
bit of the bitwise arithmetic sum of the two original registers
and xmm1 the high bit. This takes three instructions because you
have to make a temporary copy of one of the original registers.

Repeat with xmm0 and xmm2 so that now xmm0 is the low bit of the
bitwise arithmetic sum of all three registers and and the high
bit will be the sum of what's in xmm1 and xmm2 at this point.
Since there is no possibility of a carry out of the high bit,
we may compute the high bit of the bitwise arithmetic sum of the
three registers as the bitwise logical OR (or even the arithmetic
sum) of xmm1 and xmm2 at this point.

Thus we have turned what originally was a task of counting the bits
in the three registers and adding the results to our running total
into counting the bits in two registers and the seven instructions
outlined above which we may consider as one compression step.
Each compression step we perform on our data replaces one register
counting.

There are two different methods of organizing the compression
steps: the batch method and the continuous-flow method. In the
batch method, for example, one could take 7 level-0 inputs and
in 3 compression steps turn them into one level-0 input and 3
level-1 inputs. Then in 1 compression step we could transform
those 3 level-1 inputs into 1 level-1 input and 1 level-2 input.
Finally we directly count the 3 remaining inputs so it took us
3+1=4 compression steps and 3 direct countings to count up the
bits in 7 registers. With 15 inputs it would be 11 compression
steps and 4 direct countings or in general if the deepest level
created were level-N, then there would be 2**(N+1)-1 inputs,
2**(N+1)-N-2 compression steps and N+1 direct countings.

In the continuous-flow method the algorithm starts out with
a register representing the current sum at each level except
for the first. Then, for example, we could take 8 level-0
inputs and the one we already had and in 4 compression steps
turn them into one level-0 input and 4 level-1 inputs. Then
we could take those 4 level-1 inputs and the one we already
had and turn them into 2 level-2 inputs and one level-1 input
in 2 compression steps. Then finally those 2 level-2 inputs
and the one we already had can be turned in one compression
step into 1 level-2 input and 1 level-3 input. At this point
the 1 level-3 input is directly counted, leaving the 3 lower-
level inputs intact for the next stage. We have in this way
counted the bits in 8 input registers in 4+2+1 compression
steps and 1 direct counting. If the deepest level is level-N
then 2**N inputs, 2**N-1 compression steps, and one direct
counting pre trip through the inner loop.

I think the continuous-flow method is more efficient because
for the same maximum depth it turns more direct countings into
compression steps. The coding overhead you have to invest in
compression is dependent on the depth you take the compression
algorithm to, as is the number of registers required.

Originally I would have thought that negative logic benefited
the continuous-flow method more because in that case there is
no overhead because it costs the same to initialize an xmm
register to all ones as it does to initialize it to all zeros,
whereas for a level-N batch algorithm you have to logically
negate N registers per batch of 2**(N+1)-1 inputs, which takes
N+1 instructions, but it takes one less register: for the N=3
batch algorithm when the 15th input is to be processed there are
3 sum bits, 3 partial carry bits, and the running sum already in
registers so the 15th input makes 8 total registers which is a full
complement in 32-bit mode. Negative logic means we don't need an
extra register for the copy so the algorithm would fit into 8
registers.

Maybe that will inspire you to recode the batch algorithm in
negative logic, Terje!

Terje Mathisen

unread,

Jan 6, 2011, 2:17:11 AM1/6/11

to

When I first started to write bitslice code the main problem was in
fitting the data into the available registers, 6/7/8 is simply too few
to fit a wide/deep version into regs.

At the time I seemed to figure out that the order of doing the full adds
doesn't really matter, you'll still do exactly the same number of them
for a given number of inputs and compression level.

The main benefit being that you only need to actually count the bits
coming out of the top register, instead of all of them, between each
batch of inputs, but this is at least partly negated by the fact that
you can interleave several bitcount processes and get the results almost
as fast as for a single counting.

>
> Originally I would have thought that negative logic benefited
> the continuous-flow method more because in that case there is
> no overhead because it costs the same to initialize an xmm
> register to all ones as it does to initialize it to all zeros,
> whereas for a level-N batch algorithm you have to logically
> negate N registers per batch of 2**(N+1)-1 inputs, which takes
> N+1 instructions, but it takes one less register: for the N=3
> batch algorithm when the 15th input is to be processed there are
> 3 sum bits, 3 partial carry bits, and the running sum already in
> registers so the 15th input makes 8 total registers which is a full
> complement in 32-bit mode. Negative logic means we don't need an
> extra register for the copy so the algorithm would fit into 8
> registers.

Now, THAT is interesting...

>
> Maybe that will inspire you to recode the batch algorithm in
> negative logic, Terje!

Bitcount is fun, but not really critical when POPCOUNT(r64) is
available. :-)

hopcode

unread,

Jan 6, 2011, 2:50:52 AM1/6/11

to

Hi,
yesterday i kept myself amused for a while with Terje method
( http://www.ciphersbyritter.com/NEWS4/BITCT.HTM ) and the "negative
logic" method von James (
http://groups.google.com/group/comp.lang.asm.x86/browse_frm/thread/958e93d748e48121/8dab77a28ee7f005?hl=en
) ...

I think that something is possible, by mixing bitslicing
and somewhat logic, i have it in my head but i am to much
amused to reorganize the whole ;-) . I can say that
whitout a back-standing theory, bitslicing turns to be
ineffective; just like trenspassing the cache boundary by
doing calculations on the compression of the "negative logic"...

Ok,one for all,
let us fix the "minimum required" performances:
- at least 1byte/cycle or
- at least 2byte/cycle
- 16 xmm registers available
- work on 64 / 128 bytes (according to cache capability)

Please,try to neglect the fact that Intel has given out
the POPCNT instruction in SSE 4.2. (i have a final "provokation"
on it...the reason why i started this "handmade" thread)

Cheers,

--

.:hopcode[marc:rainer:kranz]:.

hopcode

unread,

Jan 6, 2011, 3:09:33 AM1/6/11

to

Il 06.01.2011 08:50, hopcode ha scritto:
> Hi,

> I think that something is possible, by mixing bitslicing

> and somewhat logic,....

But, hey, i have forgotten to say that *SIMD-thinking* is already
just like a *bitslicing-thinking*.

Terje Mathisen

unread,

Jan 6, 2011, 4:33:22 AM1/6/11

to

hopcode wrote:
> Hi,
> yesterday i kept myself amused for a while with Terje method
> ( http://www.ciphersbyritter.com/NEWS4/BITCT.HTM ) and the "negative
> logic" method von James (
> http://groups.google.com/group/comp.lang.asm.x86/browse_frm/thread/958e93d748e48121/8dab77a28ee7f005?hl=en
> ) ...
>
> I think that something is possible, by mixing bitslicing
> and somewhat logic, i have it in my head but i am to much
> amused to reorganize the whole ;-) . I can say that
> whitout a back-standing theory, bitslicing turns to be
> ineffective; just like trenspassing the cache boundary by
> doing calculations on the compression of the "negative logic"...
>
> Ok,one for all,
> let us fix the "minimum required" performances:
> - at least 1byte/cycle or
> - at least 2byte/cycle

OK.

> - 16 xmm registers available

64-bit only???

> - work on 64 / 128 bytes (according to cache capability)

That's too little: If you only need to count up to 512/1024 bits, then
you would be much better off to maintain (i.e. cache!) a separate
bitcount at all times, and update it each time you update parts of the data.

Anyway, I'm pretty sure I can beat even POPCOUNT by using the
Altivec-inspired full byte shuffle opcode:

Setup XMM7 register to contain the byte values 0,1,1,2,1,2,2,3,... etc
corresponding to the number of bits in each nybble, and XMM1 with a high
nybble mask (0xf0f0f0f0...), then count like this:

PMOVDQ XMM0,[esi] ;; Input data
PAND XMM0,XMM1 ;; 16 high nybbles
PANDN XMM1,[esi] ;; 16 low nybbles
PSRLW XMM0,4 ;; Move the high into low

At this point we have two registers with 16 nybbles each, right?

PSHUFB XMM1,XMM7
PSHUFB XMM0,XMM7 ;; Nybble bit counts

PADDB XMM6,XMM1
PMOVDQ XMM1,[high_nybble_mask]

PADDB XMM6,XMM0 ;; XMM6 is the loop accumulator

This is 9 instructions to count 16 bytes, and at least some of them will
pair, so it should be less than 9 cycles, right?

The code above can be replicated up to 31 times, giving a maximum bit
count per byte position of 252, before we have to do a horizontal merge
of the individual counts.

Due to the PSHUFB opcode, the code will need SSSE3, i.e. something like
a Core 2 Duo in order to run.

BTW, PSHUFB was almost certainly added to help Apple, since Altivec's
PERMUTE (which is very similar but slightly stronger) had been used a
_lot_ in their PPC era.

hopcode

unread,

Jan 6, 2011, 10:52:00 AM1/6/11

to

Il 06.01.2011 10:33, Terje Mathisen ha scritto:
>
> PADDB XMM6,XMM1
> PMOVDQ XMM1,[high_nybble_mask]
>
> PADDB XMM6,XMM0 ;; XMM6 is the loop accumulator
>
> This is 9 instructions to count 16 bytes, and at least some of them will
> pair, so it should be less than 9 cycles, right?

I think yes (i will check on manuals later)

> The code above can be replicated up to 31 times, giving a maximum bit
> count per byte position of 252, before we have to do a horizontal merge
> of the individual counts.

unfortunately only 31 instead of 32, because i dont know how
to manage and remember overflow, but this is definitevely
a *super "boost" idea* !!!
i think that 16 times "unrolled" before the final merge should give
a super performance *already*.

or... one moment,please, perhaps it is possible using
one normal register to store overflow after 32
cumulative summations. in fact, in the case it overflows,
it overflows for a value no larger than in a 4 bits nibble (15).
Also any RAX/RDX/RCX etc could contain 16 bytes OFlow info.

Now ok:
XMM7 (or else) for bits LUT
XMM6 (or else) accumulator
XMM0 -> XMM4 iterate on data load 8 times (8 * 4 = 32)
...

Hear You later,
Cheers

hopcode

unread,

Jan 6, 2011, 7:07:27 PM1/6/11

to

Hi, here a version (rather overbloated imho).
it satisfies minimum requirements, in this way:
it parse 64kb in ~31000 cycles, it is to say
more than 2 bytes pro cycle !!! it is not all that
good optimization we spoke, i must admit,
but as a starting point ok.
Please note that it reads only 64 byte every loop.

movdqa xmm8,dqword[mask_0F]
movdqa xmm9,dqword[mask_lut]

movdqa xmm0,dqword[rsi]
movdqa xmm1,dqword[rsi+16]
movdqa xmm2,dqword[rsi+32]
movdqa xmm3,dqword[rsi+48]

pxor xmm14,xmm14

movdqa xmm4,xmm0
movdqa xmm5,xmm1
movdqa xmm6,xmm2
movdqa xmm7,xmm3

psrlw xmm4,4
psrlw xmm5,4
psrlw xmm6,4
psrlw xmm7,4

movdqa xmm10,xmm9
movdqa xmm11,xmm9
movdqa xmm12,xmm9
movdqa xmm13,xmm9

pand xmm0,xmm8
pand xmm1,xmm8
pand xmm2,xmm8
pand xmm3,xmm8

pand xmm4,xmm8
pand xmm5,xmm8
pand xmm6,xmm8
pand xmm7,xmm8

pshufb xmm10,xmm0
pshufb xmm11,xmm1
pshufb xmm12,xmm2
pshufb xmm13,xmm3

paddb xmm14,xmm10
paddb xmm14,xmm11
paddb xmm14,xmm12
paddb xmm14,xmm13

movdqa xmm10,xmm9
movdqa xmm11,xmm9
movdqa xmm12,xmm9
movdqa xmm13,xmm9

pshufb xmm10,xmm4
pshufb xmm11,xmm5
pshufb xmm12,xmm6
pshufb xmm13,xmm7

paddb xmm14,xmm10
paddb xmm14,xmm11
paddb xmm14,xmm12
paddb xmm14,xmm13

phaddd xmm14,xmm14
phaddw xmm14,xmm14
movdq2q mm0,xmm14
phaddw mm0,mm0

movd eax,mm0
movzx ecx,ah
movzx eax,al
add eax,ecx

Cheers,

.p.s: my machine: note i have a cache of 8-way set associative, 64-byte
line size !!!

Number of cores 4 (max 4)
Number of threads 4 (max 4)
Name Intel Core 2 Quad Q9300
Codename Yorkfield
Specification Intel(R) Core(TM)2 Quad CPU Q8300 @ 2.50GHz
Package Socket 775 LGA (platform ID = 4h)
CPUID 6.7.A
Extended CPUID 6.17
Core Stepping
Technology 45 nm
Core Speed 1999.9 MHz (6.0 x 333.3 MHz)
Rated Bus speed 1333.3 MHz
Stock frequency 2500 MHz
Instructions sets MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, EM64T

L1 Data cache 4 x 32 KBytes, 8-way set associative, 64-byte line size

L1 Instruction cache 4 x 32 KBytes, 8-way set associative, 64-byte line size

L2 cache 2 x 2048 KBytes, 8-way set associative, 64-byte line size

FID/VID Control yes
FID range 6.0x - 7.5x
max VID 1.250 V
Features XD

Terje Mathisen

unread,

Jan 7, 2011, 2:25:46 AM1/7/11

to

hopcode wrote:
> Hi, here a version (rather overbloated imho).
> it satisfies minimum requirements, in this way:
> it parse 64kb in ~31000 cycles, it is to say
> more than 2 bytes pro cycle !!! it is not all that
> good optimization we spoke, i must admit,
> but as a starting point ok.
> Please note that it reads only 64 byte every loop.

You should probably get rid of most of the aggregation code, i.e. do at
least four iterations of the inner loop code so that you'll have 256
bytes aggregated (16 bytes or 128 bits per byte position.

At that point you do a single merge of bytes into words (one copy, one
word shift, one mask and one word add, plus an add into a byte-wide
accumulator.

At this point you can do 256 iterations of this outer loop,
corresponding to exactly 64KB, before you have to use a similar set of
five dword instructions in order to move your data into the dword
accumulator.

(Since the dword acc is good for 32 GB of data, it is probably safe to
stop here?)

Terje

PS. I noted from your code that I had gotten the parameter order of the
PSHUFB instruction reversed. A pity, when used as a nybble lookup table
the opposite direction would have saved all the copies. :-(

James Van Buskirk

unread,

Jan 7, 2011, 1:39:54 PM1/7/11

to

"hopcode" <hop...@nospicedham.nullnichtsnada.com> wrote in message
news:ig5lg5$seh$1...@speranza.aioe.org...

> phaddd xmm14,xmm14
> phaddw xmm14,xmm14
> movdq2q mm0,xmm14
> phaddw mm0,mm0

> movd eax,mm0
> movzx ecx,ah
> movzx eax,al
> add eax,ecx

Look up the PSADBW instruction. Apply your newfound knowledge.

Terje Mathisen

unread,

Jan 8, 2011, 3:49:39 AM1/8/11

to

James Van Buskirk wrote:
>
> Look up the PSADBW instruction. Apply your newfound knowledge.
>

Thanks James, I should have remembered that one. :-)

I was amazed when I checked and realized that it has been around since
SSE2, i.e. a long time before PSHUFB which waited until SSSE3.

Anyway, this makes it obvious that the proper way to merge nybble counts
is to add the two sets together, set a target reg to zero and use PSADBW
to generate two (64-bit) sums.

At this point we're at 5 instructions to load and split 16 bytes into 32
nybbles, 4 instr for the two PSHUFBs, one PADDB to merge them and one
PXOR, one PSADBW and one PADDQ to get the data into the two final 64-bit
accumulators: A total of 13 instructions.

movdqa xmm0,[esi]
movdqa xmm1,xmm7 ; High nybble mask
pandn xmm1,xmm0 ; Keep the low nybbles
pand xmm0,xmm7
psrlw xmm0,4

movdqa xmm2,xmm6 ; Nybble bit count copy
movdqa xmm3,xmm6 ; Nybble bit count copy

pshufb xmm2,xmm0
pshufb xmm3,xmm1

paddb xmm2,xmm3

We can as previously unroll (or loop) together 16 iterations of the
first 10 instructions, plus 15 PADDBs, before the PSADBW sequence.

I.e. ~11 SSE instructions per 16 bytes.

hopcode

unread,

Jan 8, 2011, 4:44:49 AM1/8/11

to

Il 07.01.2011 19:39, James Van Buskirk ha scritto:
> "hopcode"<hop...@nospicedham.nullnichtsnada.com> wrote in message
> news:ig5lg5$seh$1...@speranza.aioe.org...
>
>> phaddd xmm14,xmm14
>> phaddw xmm14,xmm14
>> movdq2q mm0,xmm14
>> phaddw mm0,mm0
>
>> movd eax,mm0
>> movzx ecx,ah
>> movzx eax,al
>> add eax,ecx
>
> Look up the PSADBW instruction. Apply your newfound knowledge.
>

Yeap, cool, 3 bytes pro cycle in my tests !!!!

James Van Buskirk

unread,

Jan 8, 2011, 6:23:30 PM1/8/11

to

"hopcode" <hop...@nospicedham.nullnichtsnada.com> wrote in message

news:ig9bmh$tb0$1...@speranza.aioe.org...

> Il 07.01.2011 19:39, James Van Buskirk ha scritto:

>> Look up the PSADBW instruction.

> Yeap, cool, 3 bytes pro cycle in my tests !!!!

One thing I have noticed with OOO processors is that it can be a
good thing to use the OOO capabilities. To do this, you write
instruction streams consecutively rather than concurrently and let
the processor read across instruction streams, renaming registers
and so utilizing the physical register file which is larger than the
visible register file.

_popcnt8:
movdqa %xmm6, 8(%rsp)
movdqa %xmm7, 24(%rsp)
movdqa nonsense(%rip), %xmm1
movdqa LUT(%rip), %xmm3
xorps %xmm6, %xmm6
xorps %xmm7, %xmm7
movl (%rdx), %edx
.align 16
hopcode_1c:
movaps (%rcx), %xmm0
movaps %xmm1, %xmm2
andnps %xmm0, %xmm1
andps %xmm2, %xmm0
movaps %xmm3, %xmm4
psrlw $4, %xmm1
pshufb %xmm0, %xmm3
movaps %xmm4, %xmm5
pshufb %xmm1, %xmm5
paddb %xmm3, %xmm5
movaps 16(%rcx), %xmm0
movaps %xmm2, %xmm1
andnps %xmm0, %xmm2
andps %xmm1, %xmm0
psrlw $4, %xmm2
movaps %xmm4, %xmm3
pshufb %xmm0, %xmm4
paddb %xmm4, %xmm5
movaps %xmm3, %xmm0
pshufb %xmm2, %xmm0
paddb %xmm0, %xmm5
movaps 32(%rcx), %xmm0
movaps %xmm1, %xmm2
andnps %xmm0, %xmm1
andps %xmm2, %xmm0
psrlw $4, %xmm1
movaps %xmm3, %xmm4
pshufb %xmm0, %xmm3
paddb %xmm3, %xmm5
movaps %xmm4, %xmm0
pshufb %xmm1, %xmm0
paddb %xmm0, %xmm5
movaps 48(%rcx), %xmm0
movaps %xmm2, %xmm1
andnps %xmm0, %xmm2
andps %xmm1, %xmm0
psrlw $4, %xmm2
movaps %xmm4, %xmm3
pshufb %xmm0, %xmm4
paddb %xmm4, %xmm5
movaps %xmm3, %xmm0
pshufb %xmm2, %xmm0
paddb %xmm0, %xmm5
psadbw %xmm6, %xmm5
paddd %xmm5, %xmm7
addq $64, %rcx
subl $64, %edx
jg hopcode_1c
movhlps %xmm7, %xmm6
paddd %xmm6, %xmm7
movq %xmm7, %rax
movaps 8(%rsp), %xmm6
movaps 24(%rsp), %xmm7
ret

In the above, nonsense(%rip) is the low nybble mask and LUT(%rip) is
the 4-bit LUT. Also note the rotation of constants between registers
xmm1 and xmm2 and also between xmm3 and xmm4. This is intended to
avoid leaving dead registers. It timed at about 13400 clocks for
32768 bytes vs. 14200 clocks for the concurrent version. Note that
this is significantly slower than your processor because PSHUFB
takes 4 uops on my Core 2 Duo E6700 compared to 1 uop on your CPU.
See if this technique helps on your processor.

hopcode

unread,

Jan 30, 2011, 6:15:04 AM1/30/11

to

Hi, All ! Does this post reach the newsgroup ? (i do hope yes)
Ok, After reading "Fast bit reverse" thread on alt.lang.asm
i think the best way to count bits is by using PSHUFB

Il 09.01.2011 00:23, James Van Buskirk ha scritto:
> One thing I have noticed with OOO processors is that it can be a
> good thing to use the OOO capabilities. To do this, you write
> instruction streams consecutively rather than concurrently and let
> the processor read across instruction streams, renaming registers
> and so utilizing the physical register file which is larger than the
> visible register file.
>

> consecutively rather than concurrently

Hi, James
thanks for Your precious help

OoO + SIMD sounds to me somehow magical/obscure yet.
I know it (i scent it ) but i cannot bring evidence to
concurrence/renaming etc.

> movaps 8(%rsp), %xmm6
> movaps 24(%rsp), %xmm7

What does it mean ? perhaps
movaps xmm6,[rsp+8]
movaps xmm7,[rsp+24]

> ... This is intended to

> avoid leaving dead registers. It timed at about 13400 clocks for
> 32768 bytes vs. 14200 clocks for the concurrent version. Note that
> this is significantly slower than your processor because PSHUFB
> takes 4 uops on my Core 2 Duo E6700 compared to 1 uop on your CPU.
> See if this technique helps on your processor.
>

PSHUFB XMM,XMM letency 4,throughput 3 times 1.70 "pairable" clock cycles
on my Core2 Quad. 2 consecutive PSHUFB instructions times just the same !!!

Now, executing 300 times my POPCNT on 8 megabyte file gives
me *11,2 bytes/clock*, /the best/ absolute performance on

http://www.dalkescientific.com/writings/diary/archive/2008/07/05/bitslice_and_popcount.html

of about 3 up to 8 byte/clock (according to different processors and
flags) of those methods (there you will find the bitslice method too)

Cheers,

hopcode

unread,

Jan 30, 2011, 6:32:34 AM1/30/11

to

Il 30.01.2011 12:15, hopcode ha scritto:
>
> Now, executing 300 times my POPCNT on 8 megabyte file gives
> me *11,2 bytes/clock*, /the best/ absolute performance on
>

Forgotten to say that my new version eats 128 bytes for each
internal loop. That is an unrolled version, using only XMM1->XMM6
registers.

/for the record/ Using *prefetchnt* on aligned datas,
it does not improve much.

Robert Redelmeier

unread,

Jan 30, 2011, 11:51:03 AM1/30/11

to

hopcode <hop...@nospicedham.nullnichtsnada.com> wrote in part:

> Il 30.01.2011 12:15, hopcode ha scritto:
>> Now, executing 300 times my POPCNT on 8 megabyte file gives
>> me *11,2 bytes/clock*, /the best/ absolute performance on
>>
> Forgotten to say that my new version eats 128 bytes for
> each internal loop. That is an unrolled version, using only
> XMM1->XMM6 registers.
>
> /for the record/ Using *prefetchnt* on aligned datas,
> it does not improve much.

With what stride? The distance (bytes, often hundreds) of the
read-ahead preload has a large impact on its' effectiveness,
_when_ prefetch can help at all.

PREFETCH is best when the [complex] calcs and memory reads
take roughly the same time, and it can never save more than
50% of time by overlapping the two. So it gets a bad rep.

There are two extreme cases: the [common] memory/bandwidth-limited,
where the calcs (however complex looking) only take a few cycles
and where prefetch can only save those few cycles; and the [rare]
computation-limited where saving a few hundred clocks is tiny
compared to the FP/ALU ops.

A useful metric is to compare machine memory ops with total runtime.
Time your pgm, then comment out all the work except the read/writes
(MOV,r/m) and time again. This runs through all the mem accesses in
however regular or irregular your pattern and gives you a bandwidth
measure for this algorithm. It might be far different from your
memory "nameplate" or benchmark results. If so, optimization
should focus on changing the cache fit or order of accesses.

Modern machines run ~3 GHz per core with memory bandwidth ~2 GB/s.
That gives _at_least_ 1.5 clocks per byte or 24 clocks per slab128.
POPCNT is a little unusual, a simple readstream unless you are
storing those counts in a second vector. Then you'd better work on
keeping the writes from breaking (and slowing) the reads too often.
You can save the read-before-write with MOVNTQ, but I don't know
if bus burstlength can be influenced (SFENCE?)

-- Robert

James Van Buskirk

unread,

Jan 30, 2011, 2:12:49 PM1/30/11

to

"hopcode" <hop...@nospicedham.nullnichtsnada.com> wrote in message

news:ii3h7s$jiu$1...@speranza.aioe.org...

> Hi, All ! Does this post reach the newsgroup ? (i do hope yes)
> Ok, After reading "Fast bit reverse" thread on alt.lang.asm
> i think the best way to count bits is by using PSHUFB

> Il 09.01.2011 00:23, James Van Buskirk ha scritto:
>> One thing I have noticed with OOO processors is that it can be a
>> good thing to use the OOO capabilities. To do this, you write
>> instruction streams consecutively rather than concurrently and let
>> the processor read across instruction streams, renaming registers
>> and so utilizing the physical register file which is larger than the
>> visible register file.

> > consecutively rather than concurrently

> > movaps 8(%rsp), %xmm6
> > movaps 24(%rsp), %xmm7

> What does it mean ? perhaps
> movaps xmm6,[rsp+8]
> movaps xmm7,[rsp+24]

Yes. I was using as.exe, the GNU assembler, but you were able
to decode it.

> PSHUFB XMM,XMM letency 4,throughput 3 times 1.70 "pairable" clock cycles
> on my Core2 Quad. 2 consecutive PSHUFB instructions times just the same
> !!!

> Now, executing 300 times my POPCNT on 8 megabyte file gives
> me *11,2 bytes/clock*, /the best/ absolute performance on

If that means 11.2 bytes per clock, I am sceptical.

(16 bytes)/(xmm register)*(1 clock)/(11.2 bytes) = 1.43 clocks/(xmm
register)

The code we discussed required 1+1+1+2+2+2+1+1+1 = 12 instructions to
load a register, duplicate it, shift one of the copies, mask out the
high bits in both copies, load 2 LUT registers, perform two lookups,
add the bytewise bitcounts, sum into qwords, and add to running totals.
Your code would have to be sustaining
(12 instructions)/(1.43 clocks) = 8.4 instructions per clock.
I tried looking in your web page below, but couldn't find your code,
but I recommend you reexamine your timing technique.

> --
>
> .:hopcode[marc:rainer:kranz]:.
> x64 Assembly Lab
> http://sites.google.com/site/x64lab

Also in your web page it says that WindowFromPoint passes its
argument by reference on 32-bit Windows and by value on 64-bit
Windows. This is incorrect: WindowFromPoint passes by value
on both platforms, just check MSDN. I remember tripping over
this just recently with the MonitorFromPoint function: my
program worked in 64-bit Windows but failed to link in 32-bit
Windows because the name mangling gfortran generated from my
interface specification was _MonitorFromPoint@8 but because
argument pt is passed by value the mangled name in 32-bit Windows
is actually _MonitorFromPoint@12 . Fixing the interface to
specify passing by value made everything work, even on the rather
more difficult 32-bit Windows platform.

hopcode

unread,

Jan 30, 2011, 5:45:27 PM1/30/11

to

Il 30.01.2011 17:51, Robert Redelmeier ha scritto:
> hopcode<hop...@nospicedham.nullnichtsnada.com> wrote in part:
>> Il 30.01.2011 12:15, hopcode ha scritto:
>>> Now, executing 300 times my POPCNT on 8 megabyte file gives
>>> me *11,2 bytes/clock*, /the best/ absolute performance on
>>>
>> Forgotten to say that my new version eats 128 bytes for
>> each internal loop. That is an unrolled version, using only
>> XMM1->XMM6 registers.
>>
>> /for the record/ Using *prefetchnt* on aligned datas,
>> it does not improve much.
>
>
> With what stride? The distance (bytes, often hundreds) of the
> read-ahead preload has a large impact on its' effectiveness,
> _when_ prefetch can help at all.

Many and different strides. From 256 to 32kb. Best improvements
on *64kb data* are ~10% that results in 4 bytes/clock.

> PREFETCH is best when the [complex] calcs and memory reads
> take roughly the same time, and it can never save more than
> 50% of time by overlapping the two. So it gets a bad rep.

... and my tests confirm exactly what in Your statement.

> There are two extreme cases: the [common] memory/bandwidth-limited,
> where the calcs (however complex looking) only take a few cycles
> and where prefetch can only save those few cycles; and the [rare]
> computation-limited where saving a few hundred clocks is tiny
> compared to the FP/ALU ops.

i agree with You 100%.
PREFETCH is a quite new instruction to me; looking at the the first code
there's no apparent need to use PREFETCH!! Then i thought: why one
should avoid using it, even if instruction-execution-time doesnt overlap
memory-read-time ?

hopcode

unread,

Jan 30, 2011, 6:18:07 PM1/30/11

to

Il 30.01.2011 20:12, James Van Buskirk ha scritto:
>
> If that means 11.2 bytes per clock, I am sceptical.
>
> (16 bytes)/(xmm register)*(1 clock)/(11.2 bytes) = 1.43 clocks/(xmm
> register)
>

That is! RDTSC on 8mb data, average EDX=0 EAX=700000-900000
I am sceptical too yet. But the recipe (my own personal) i use to time
is 100% guaranteed working. On 64kb data and using PREFETCH,
improvements reach the 4 bytes/cycle. Now, why, i ask myself, such
remarkable improvements, 11.2 bytes/cycle, on 8 megabyte data ?

Note that here

http://dalkescientific.com/writings/diary/popcnt.c

8 megs of junk will be created *only once*; they are *already*
in the cache on subsequent test-routines.

I cannot check Your routine, but i think if You test it under
those conditions (300 iteration on 8megabyte data)
i guess it could touch the 12-14 bytes/cycle !!!

...or... any idea ?

>-----

> Also in your web page it says that WindowFromPoint passes its
> argument by reference on 32-bit Windows and by value on 64-bit
> Windows. This is incorrect: WindowFromPoint passes by value
> on both platforms, just check MSDN.
>

Right,i checked it just now and the right 32bit proto of
WindowFromPoint is

push [pt.y] ;dword
push [pt.x] ;dword
call [WindowFromPoint]

it's so long time that i abandoned 32bit Winapi that i
forgot almost all :-) I will fix it on website. Thanks for
the note. What about MSN docs ?

Cheers,

hopcode

unread,

Jan 31, 2011, 8:40:36 AM1/31/11

to

Il 31.01.2011 00:18, hopcode ha scritto:
> Il 30.01.2011 20:12, James Van Buskirk ha scritto:
>>
>> If that means 11.2 bytes per clock, I am sceptical.
>>
>> (16 bytes)/(xmm register)*(1 clock)/(11.2 bytes) = 1.43 clocks/(xmm
>> register)

ehm... i am very sorry,
i took the stick from the *wrong* *wrong* end.
it didnt see a damn digit typo in the external loop!
it seemed so strange...

correct tests confirm exactly a sloppy 1-2 bytes/clock

Should i been banned for this ?

:-)

hopcode

unread,

Feb 17, 2011, 11:52:13 PM2/17/11

to

Hi All,
thank for your help. After reading carefully
all your posts i am almost sure that PSHUFB
wins 2x on other good and nice solutions.
I am doing some more experience on using
the PREFETCH instruction, timing till up to
30% on my previous code.
Anyway i cannot break the barrier of 4byte/cycle.
I would reach the 6byte/cycle !!! But i think
it is impossible by using the PSHUFB.

There is something unclear to me yet about
the use of the cache.Anyway i do hold the
following timings and tests as mostly unreliable,

http://dalkescientific.com/writings/diary/popcnt.c

they are coherent timings in the whole,ok, /per se/
one related to another, but obviously unreal, because of
the extra overhead of C and CRT functions compilation/stub.

So "unreal" as real and concrete is the non-serializing RDTSC !
Bet you that the gettimeofday() has the RDTSC in its
implementation ? ;-)

But i had that bravery to test them again: GCC and other
overbloating flags! i did it...
BTW: the 64kb "brute force" LUT16 is faster than the BITSLICE24,
on my Quad machine !!- isn't it incredible ?

So i would like to test your assembly routines
directly. The Terje's BITSLICE(24) and
a definitive James's CSA version. If there is
a place where to download the source code from,
please tell me.

wolfgang kern

unread,

Feb 18, 2011, 2:04:03 PM2/18/11

to

hopcode wrote:

> thank for your help. After reading carefully
> all your posts i am almost sure that PSHUFB
> wins 2x on other good and nice solutions.
> I am doing some more experience on using
> the PREFETCH instruction, timing till up to
> 30% on my previous code.
> Anyway i cannot break the barrier of 4byte/cycle.
> I would reach the 6byte/cycle !!! But i think
> it is impossible by using the PSHUFB.

For memory access the limit is just the BUS-frequency...
Prefetch does only make sense for 'free' cache-lines to become
cached ahead (this takes time anyway), so only effective if
used in front of a loop which access the precashed area often.

> There is something unclear to me yet about

> the use of the cache. Anyway i do hold the

> following timings and tests as mostly unreliable,

> http://dalkescientific.com/writings/diary/popcnt.c

Yeah, this figures doesn't match my tools ....
POPCNT-instruction seem to do it quite faster than a discrete way.

> they are coherent timings in the whole,ok, /per se/
> one related to another, but obviously unreal, because of
> the extra overhead of C and CRT functions compilation/stub.

It's hard to tell which part of your code and MEM access become
top priority under a certain environment.

> So "unreal" as real and concrete is the non-serializing RDTSC !
> Bet you that the gettimeofday() has the RDTSC in its
> implementation ? ;-)

RDTSC itself isn't serialising, so it's usually used with an
dummy CPUID ahead (but only for the first RDTSC of course).

> But i had that bravery to test them again: GCC and other
> overbloating flags! i did it...
> BTW: the 64kb "brute force" LUT16 is faster than the BITSLICE24,
> on my Quad machine !!- isn't it incredible ?

LUTs, while small enough to reside in cache are for sure the fastest
solution.

> So i would like to test your assembly routines
> directly. The Terje's BITSLICE(24) and
> a definitive James's CSA version. If there is
> a place where to download the source code from,
> please tell me.

Yeah, both examples above show how it could be done , but ...
Not meant offending at all: why don't you use your brain to
find the best solution for your own given task :)
We ASMers can make raw calculations on timing of our code by
using the latency/throughput info from the CPU-manuals and
adding some ideas of where caching takes place...

And I'd recommend: never try any HLL for timing benchmarks!
__
wolfgang

hopcode

unread,

Feb 18, 2011, 8:04:32 PM2/18/11

to

Il 18.02.2011 20:04, wolfgang kern ha scritto:
>
> It's hard to tell which part of your code and MEM access become
> top priority under a certain environment.

Hi wolfgang,
that is right in most cases, but i wonder whether other OS have
something similiar to the relatively accurate
QueryPerformanceFrequency/QueryPerformanceCounter duo.

>
> Yeah, both examples above show how it could be done , but ...
> Not meant offending at all: why don't you use your brain to
> find the best solution for your own given task :)

ehm... i am a bit lazy. cut'n'paste from disassembled code
is too much too. :-) anyway i asked for a new improved
assembly coded version, if any, since the well known ones
of some times ago.

> We ASMers can make raw calculations on timing of our code by
> using the latency/throughput info from the CPU-manuals and
> adding some ideas of where caching takes place...

i am learning now doing empirical latency/throughput
calculations on one hand,as i have seen here on CLAX
from someone among you.

> And I'd recommend: never try any HLL for timing benchmarks!

that's sure!

Now, here my timings using PSHUFB in *pure* asm environmnent
using a RDTSC personal recipe i customized from several different
opinions and tests.

Old version code:

MEGABYTE TIMING
a) 8.38 5.49 > 1 byte / clock
b) 4.19 2.73 ~ 2 byte / clock
c) 2.09 0.74 ~ 3 byte / clock
d) 1.04 0.32 > 3 byte / clock

New version code, using PREFETCH

MEGABYTE TIMING
a) 8.38 3.98 > 2 byte / clock ~30% faster
b) 4.19 2.01 > 2 byte / clock ~30% faster
c) 2.09 0.73 <~ 3 byte / clock -- no gain !!! --
d) 1.04 0.34 >= 3 byte / clock -- no gain !!! --

Considering i mount
Quad Core
Bus Speed 333.3 Mhz
all cache 64 byte 8-way/set associative
L1-D 4x32, L2-I 4x32 ,L2 2x2048

i can reach 4 byte/cycle using some different strides.
But i am working on the prefetching theory yet.

wolfgang kern

unread,

Feb 19, 2011, 6:43:36 AM2/19/11

to

"hopcode" wrote:

>> It's hard to tell which part of your code and MEM access become
>> top priority under a certain environment.

> Hi wolfgang,
Hello,

> that is right in most cases, but i wonder whether other OS have something
> similiar to the relatively accurate
> QueryPerformanceFrequency/QueryPerformanceCounter duo.

These performance counters are part of the CPU and seem to work
on a quite coarse scale

>> Yeah, both examples above show how it could be done , but ...
>> Not meant offending at all: why don't you use your brain to
>> find the best solution for your own given task :)

> ehm... i am a bit lazy. cut'n'paste from disassembled code
> is too much too. :-) anyway i asked for a new improved
> assembly coded version, if any, since the well known ones
> of some times ago.

:)

It often needs reorganisation of structs and arrays to find
any gain with prefetch.
You could check the machine specific timing of prefetch, ie:

CPUID
RDTSC
mov ebx,eax
mov ecx,edx
prefetch [esi]
RDTSC
sub eax,ebx
sbb edx,ecx
int 3 ;watch edx:eax in a debugger yet

against a variant which got NOPs instead of the prefetch,
but I wouldn't run this multiple times in a loop, just once.

__
wolfgang

Bob Masta

unread,

Feb 19, 2011, 8:04:01 AM2/19/11

to

On Sat, 19 Feb 2011 12:43:36 +0100, "wolfgang kern"
<now...@never.at> wrote:

>"hopcode" wrote:

>> Hi wolfgang,
>Hello,
>> that is right in most cases, but i wonder whether other OS have something
>> similiar to the relatively accurate
>> QueryPerformanceFrequency/QueryPerformanceCounter duo.
>
>These performance counters are part of the CPU and seem to work
>on a quite coarse scale
>

I seem to recall finding that QueryPerfomanceCounter
returns results similar to RDTSC on XP and later. On Win9x
it uses Ye Olde Tyme system timer at 55 msec.
But QueryPerfomanceFrequency tells you what timer speed is
used on your particular system.

Best regards,

Bob Masta

DAQARTA v6.00
Data AcQuisition And Real-Time Analysis
www.daqarta.com
Scope, Spectrum, Spectrogram, Sound Level Meter
Frequency Counter, FREE Signal Generator
Pitch Track, Pitch-to-MIDI
Science with your sound card!

Rick C. Hodgin

unread,

Feb 19, 2011, 7:36:14 PM2/19/11

to

On Saturday, February 19, 2011 7:04:01 AM UTC-6, Bob Masta wrote:
> I seem to recall finding that QueryPerformanceCounter

> returns results similar to RDTSC on XP and later.

I just stepped through the assembly code generated in Windows XP and Visual Studio 2008. The function makes a call into the OS with SYSENTER, and returns with the values in registers. It also goes through several indirect layers of calling to get there.

- Rick C. Hodgin

wolfgang kern

unread,

Feb 20, 2011, 2:37:41 AM2/20/11

to

Bob Masta mentioned:
...

>>> that is right in most cases, but i wonder whether other OS have
>>> something
>>> similiar to the relatively accurate
>>> QueryPerformanceFrequency/QueryPerformanceCounter duo.

>>These performance counters are part of the CPU and seem to work
>>on a quite coarse scale

> I seem to recall finding that QueryPerfomanceCounter
> returns results similar to RDTSC on XP and later. On Win9x
> it uses Ye Olde Tyme system timer at 55 msec.
> But QueryPerfomanceFrequency tells you what timer speed is
> used on your particular system.

I see, so this windoze-functions may not mean the CPU-internal
"Performance Counters".
A while back (INC vs ADD thread) we saw benchmark values reported to
be 4..15 cycles (instead of 0..1) which may come from such functions :)
__
wolfgang

Bob Masta

unread,

Feb 20, 2011, 9:35:03 AM2/20/11

to

Oops! My internal "recall" chipset needs to be recalled!
OK, so I ran a few tests of QueryPerformanceFrequency:

1,193,180 Hz on 200 MHz Win95
3,579,545 Hz on 1.6 GHz XP
1,363,828 Hz on ??? GHz Win7

(I don't know how to find the CPU clock frequency on Win7...
seems the good folks in Redmond decided that little detail
was no longer worth including in Control Panel... or at
least they have it well hidden!)

Anyway, the Win95 PerformanceFrequency is the original PC
system timer clock frequency, which when 65536 ticks were
counted up in hardware gave the original 55 msec (18.2 Hz)
Time-of-Day clock interrupt.

The XP frequency is 3 times that, which is the TV color
burst frequency, which was supplied on the old ISA bus for
use by the original CGA graphics cards.

The old PC/XT systems used a single master clock at 4 times
the color burst, or 14,318,180 Hz, and divided it by 3 to
get the 4.77 MHz CPU clock, by 4 to get the CGA signal, and
by 12 to get the system timer clock. This is the reason
for the strange 4.77 MHz value... they wanted to derive it
from the master clock (crystal oscillators are expensive),
and they wanted to keep under the 5 MHz limit of the 8088
and peripheral chips of the day.

So it seems the legacy stuff is still in there, or was at
least back when my XP system was new. (No idea what's going
on with Win7.) I think this indicates that the
PerformanceFrequency timer is in a separate chip, which
isn't affected when the CPU clock is reduced for power
management. (Which *would* affect the RDTSC values.) At
least, on the XP system, QueryPerformanceFrequency returned
the same value when the CPU Speed setting was at Max or
Slow. It would make sense that anything used by the OS for
timing shouldn't be affected by the CPU speed.

Sorry for my memory lapse!

Tim Roberts

unread,

Feb 21, 2011, 12:31:41 AM2/21/11

to

"wolfgang kern" <now...@never.at> wrote:
>
>Bob Masta mentioned:

>> But QueryPerfomanceFrequency tells you what timer speed is
>> used on your particular system.
>
>I see, so this windoze-functions may not mean the CPU-internal
>"Performance Counters".

No. Depending on the version of the operating system and the capabilities
of your motherboard, it might be the motherboard countdown timer, the CPU
cycle counter, or the APIC timer.
--
Tim Roberts, ti...@probo.com
Providenza & Boekelheide, Inc.

wolfgang kern

unread,

Feb 21, 2011, 10:41:15 AM2/21/11

to

Bob Masta wrote:
...

>>>>These performance counters are part of the CPU and seem to work
>>>>on a quite coarse scale

>>> I seem to recall finding that QueryPerfomanceCounter
>>> returns results similar to RDTSC on XP and later. On Win9x
>>> it uses Ye Olde Tyme system timer at 55 msec.
>>> But QueryPerfomanceFrequency tells you what timer speed is
>>> used on your particular system.

>>I see, so this windoze-functions may not mean the CPU-internal
>>"Performance Counters".
>>A while back (INC vs ADD thread) we saw benchmark values reported to
>>be 4..15 cycles (instead of 0..1) which may come from such functions :)

> Oops! My internal "recall" chipset needs to be recalled!

> OK, so I ran a few tests of QueryPerformanceFrequency:

> 1,193,180 Hz on 200 MHz Win95
> 3,579,545 Hz on 1.6 GHz XP
> 1,363,828 Hz on ??? GHz Win7

> (I don't know how to find the CPU clock frequency on Win7...
> seems the good folks in Redmond decided that little detail
> was no longer worth including in Control Panel... or at
> least they have it well hidden!)

> Anyway, the Win95 PerformanceFrequency is the original PC
> system timer clock frequency, which when 65536 ticks were
> counted up in hardware gave the original 55 msec (18.2 Hz)
> Time-of-Day clock interrupt.

I use the RTCL IRQ (every Second) for Time of Day and good
old PIC for timeslices (1mS) for now.

> The XP frequency is 3 times that, which is the TV color
> burst frequency, which was supplied on the old ISA bus for
> use by the original CGA graphics cards.

> The old PC/XT systems used a single master clock at 4 times
> the color burst, or 14,318,180 Hz, and divided it by 3 to
> get the 4.77 MHz CPU clock, by 4 to get the CGA signal, and
> by 12 to get the system timer clock. This is the reason
> for the strange 4.77 MHz value... they wanted to derive it
> from the master clock (crystal oscillators are expensive),
> and they wanted to keep under the 5 MHz limit of the 8088
> and peripheral chips of the day.

Yeah I remember how hard it once was to get such an X-tal here,
because we had PAL-TV with color burstrate at 4.43 MHz.

> So it seems the legacy stuff is still in there, or was at
> least back when my XP system was new. (No idea what's going
> on with Win7.) I think this indicates that the
> PerformanceFrequency timer is in a separate chip, which
> isn't affected when the CPU clock is reduced for power
> management. (Which *would* affect the RDTSC values.) At
> least, on the XP system, QueryPerformanceFrequency returned
> the same value when the CPU Speed setting was at Max or
> Slow. It would make sense that anything used by the OS for
> timing shouldn't be affected by the CPU speed.

My AMD-CPU has TSC independent (at maximum) of CPU powerstates.
a set bit show this feature: CPUID 8000_0007(b8[edx]).

> Sorry for my memory lapse!

Nothing wrong :) I can only guess what windoze does anyway,
Tim cleared it up for us.

__
wolfgang

wolfgang kern

unread,

Feb 21, 2011, 9:59:53 AM2/21/11

to

Tim Roberts replied:

>>> But QueryPerfomanceFrequency tells you what timer speed is
>>> used on your particular system.

>>I see, so this windoze-functions may not mean the CPU-internal
>>"Performance Counters".

> No. Depending on the version of the operating system and the capabilities
> of your motherboard, it might be the motherboard countdown timer, the CPU
> cycle counter, or the APIC timer.

Thanks, me too think about to use other hardware opportunities than
just the PIT (set to ~1mS) for my timeslice driven tasks in future.

__
wolfgang

hopcode

unread,

Apr 1, 2011, 9:52:07 AM4/1/11

to

Ok, in order to refresh our knowledge i will post here
2 precious links, and my original (and provocative) question:

"...but i wonder whether other OS have something similiar to the
relatively accurate QueryPerformanceFrequency/QueryPerformanceCounter
duo. "

link 1 ) http://msdn.microsoft.com/en-us/windows/hardware/gg463347
link 2)
file:///C:/Users/marc/Documents/asm/gui/measure-code-sections-using-the-enhanced-timer.htm

Just now i have found a way to eliminate the MOVDQA-cycle-fresser
instruction in the loop to break the 4byte/cycle barrier. i think PSHUF
wins theorethically almost everywhere. i will test it in the following
days. Stay tuned! :-)

Robert Redelmeier

unread,

Apr 1, 2011, 5:17:39 PM4/1/11

to

hopcode <hop...@nullnichtsnada.com> wrote in part:

> Ok, in order to refresh our knowledge i will post here 2
> precious links, and my original (and provocative) question:
>
> "...but i wonder whether other OS have
> something similiar to the relatively accurate
> QueryPerformanceFrequency/QueryPerformanceCounter duo. "

I would normally think RDTSC is _by_far_ the most accurate and
preferable performance measure for cycle counts 20-2,000,000 .
Shorter cycle counts are hard to do by any means (most often
done by repetition and dividing) and longer counts get into
interrupt and task-switching issues.

AFAIK, most OSes other than some old versions of MS-Windows
allow the RDTSC instruction and it saves the ring0 transition.

To get an idea of CPU clock and TSC offsets, try running
1000+ RDTSC / sleep(1) / RDTSC sets and carefully check
the distribution of results. You can also run sleep(0)s.

-- Robert

Frank Kotler

unread,

Apr 1, 2011, 10:08:49 PM4/1/11

to

hopcode wrote:

...

>> 1,363,828 Hz on ??? GHz Win7
>>
>> (I don't know how to find the CPU clock frequency on Win7...

There's a cpuid option that will return a more extensive string than
"genuine intel" or whatever. It includes the speed the CPU is capable of
- not necessarily the speed it's actually running. This crude example is
for Linux, but will give the idea... No idea if it's useful to you or
not. (at least it's "assembly language"!)

Best,
Frank

;--------------------
global _start

section .bss
namestring resb 48

section .text
_start:

mov eax, 80000000h
cpuid
cmp eax, 80000004h
jb exit

mov edi, namestring

mov eax, 80000002h
cpuid
call savestring

mov eax, 80000003h
cpuid
call savestring

mov eax, 80000004h
cpuid
call savestring

mov ecx, namestring
mov edx, 48
mov ebx, 1
mov eax, 4
int 80h

exit:
mov eax, 1
int 80h

savestring:
stosd
mov eax, ebx
stosd
mov eax, ecx
stosd
mov eax, edx
stosd
ret
;----------------