SSE Converting DQWORD to padded string

hopcode

unread,

Apr 22, 2011, 11:04:58 PM4/22/11

to

Hi All,
want to share this snippet, SSE Converting DQWORD to string;
having in RDI a 16 aligned destination buffer of at least
32 bytes, in AL/AX/EAX/RAX a value to convert to string,
return in RDI a zero-padded string (on the left):

; mask_0F dq 0F0F0F0F'0F0F0F0Fh,0F0F0F0F'0F0F0F0Fh
; mask_str dq 6766656463626160h,7675'7473'72716968h
; mask_30 dq 3030303030303030h,3030303030303030h

;--- usage ----------------

mov rdi,dest
mov rax,0123456789ABCDEFh
call .dq2a
ret

;--- proc
.dq2a:
movdqa xmm2,dqword [mask_str]
movdqa xmm3,xmm2
movdqa xmm4,dqword[mask_0F]
movdqa xmm5,dqword[mask_30]
bswap rax
mov qword[rdi],rax
punpcklbw xmm0,dqword[rdi]
pand xmm0,xmm4
punpcklbw xmm1,dqword[rdi]
psrlw xmm1,4
pand xmm1,xmm4
psrlw xmm1,8
pshufb xmm2,xmm0
pshufb xmm3,xmm1
por xmm2,xmm3
psubb xmm2,xmm5
movdqa dqword[rdi],xmm2
ret 0

Tell me what you think about

Cheers,

--

.:hopcode[marc:rainer:kranz]:.
x64 Assembly Lab
http://sites.google.com/site/x64lab

hopcode

unread,

Apr 23, 2011, 7:04:04 PM4/23/11

to

A new enhanced version of this snippet on
x64lab Google group at

https://groups.google.com/group/x64lab/browse_thread/thread/b72125ac10cc01e4?hl=en

Il 23.04.2011 05:04, hopcode ha scritto:
> Hi All,
> want to share this snippet, SSE Converting DQWORD to string;
> having in RDI a 16 aligned destination buffer of at least
> 32 bytes, in AL/AX/EAX/RAX a value to convert to string,
> return in RDI a zero-padded string (on the left):
>
> ; mask_0F dq 0F0F0F0F'0F0F0F0Fh,0F0F0F0F'0F0F0F0Fh
> ; mask_str dq 6766656463626160h,7675'7473'72716968h
> ; mask_30 dq 3030303030303030h,3030303030303030h
>
> ;--- usage ----------------
>
> mov rdi,dest
> mov rax,0123456789ABCDEFh
> call .dq2a
> ret
>
> ;--- proc

> ..dq2a:

Bernhard Schornak

unread,

Apr 24, 2011, 4:59:59 AM4/24/11

to

hopcode wrote:

> A new enhanced version of this snippet on
> x64lab Google group at
>
> https://groups.google.com/group/x64lab/browse_thread/thread/b72125ac10cc01e4?hl=en

Another way, throwing properly formatted
hexadecimal and decimal numbers (old and
outdated version - will be replaced with
current code sooner or later):

http://code.google.com/p/st-open/source/browse/LIB/SOURCES/core/cvt.S

All functions are compatible with Win-64
calling conventions. Used registers (ex-
cept RAX, XMM0...XMM3) are restored.

Greetings from Augsburg

Bernhard Schornak

wolfgang kern

unread,

Apr 24, 2011, 7:32:28 AM4/24/11

to

Bernhard Schornak wrote:

https://groups.google.com/group/x64lab/browse_thread/thread/b72125ac10cc01e4?hl=en

> http://code.google.com/p/st-open/source/browse/LIB/SOURCES/core/cvt.S

After viewing this bloatware-LIB I'm quite happy with my
large set of options on proper formatted conversions! :)
Even still not using SSE my way seem to be faster+shorter.

btw Bernhard, could you make PSHUFB work on your Phenom II.
I couldn't so far. Is this good thing an Intel-only SSE3 ?
Or may I miss some configuration setups ?

Bit 9(CR4) = 1, bit 10(CR4) = 0 and I think to have all
SIMD/FPU-exceptions masked off.

__
wolfgang

hopcode

unread,

Apr 24, 2011, 9:13:39 PM4/24/11

to

Hi und Frohe Ostern alle zusammen,

Il 24.04.2011 13:32, wolfgang kern ha scritto:
> large set of options on proper formatted conversions!:)
> Even still not using SSE my way seem to be faster+shorter.

Yes, i agree (see source package on
http://sourceforge.net/projects/x32lab/ file
plugin\hex_code.asm of my hexviewer where i use
a known method on 32bit general registers, ~35% faster
than the MMXs variant); but SSE is quite common today,
and the PSHUFs are very attractive instructions.

After all, the way you convert it is not so relevant;
take a look here for example,
http://www.godevtool.com/TestbugHelp/Writinghex.htm

What matters, imho, follows some few basic rules:

a) separate the output formatting routine
from the bare conversion routine.

b) having a standard abstract way when returning
infos after conversion (to allow faster/safer outputting)

This allow you having *later* 0x00AABB or 0AABBh or
00000AABBh, or using this last known facility, 0AA'BBh
and 0xAA'BB etc.

Suprisingly is that i am totally used to read a lot of
assembly (sic!) code that simply neglects the minimum
abstraction requirements. What are they ? that may be a
cool question. But every choice should have always one or
more grounds for. Note, that i speak from my contextual
experience of 3 years of assembly programming, also
i am like a baby :-)

Now using SSE, for example, give some more advantage when
facing big bulk of userland-data-conversions in/from the
unicode context. Being Unicode practically not an option,
nor a choice, the rest follows /mainly/ automatically.

But even there, in that unicode context, on 64bit
i confess i make lot of use again of general registers yet,
avoiding the SSE 4.2 quite uncommon set.

Half squandered RAX register is worse than the ready excuse
to continue 32bit programming in compatibility mode
only because "64bit arent really needed nowadays".

Immer dieselbe Leier, oder ? :-)

Bernhard Schornak

unread,

Apr 25, 2011, 1:00:38 AM4/25/11

to

wolfgang kern wrote:

> Bernhard Schornak wrote:
>
>>> A new enhanced version of this snippet on
>>> x64lab Google group at
>
> https://groups.google.com/group/x64lab/browse_thread/thread/b72125ac10cc01e4?hl=en
>
>
>> Another way, throwing properly formatted
>> hexadecimal and decimal numbers (old and
>> outdated version - will be replaced with
>> current code sooner or later):
>
>> http://code.google.com/p/st-open/source/browse/LIB/SOURCES/core/cvt.S
>
>> All functions are compatible with Win-64
>> calling conventions. Used registers (ex-
>> cept RAX, XMM0...XMM3) are restored.
>
> After viewing this bloatware-LIB I'm quite happy with my
> large set of options on proper formatted conversions! :)
> Even still not using SSE my way seem to be faster+shorter.

To be honest: Conversions are the most utilised
functions, so this was the first module I wrote
on thin 64-bit ice - "mit der heissen Nadel ge-
strickt". As long as the remaining parts of the
libraries and applications aren't ported, there
is no time left to recode them. It is important
they work reliable in the first place - winning
beauty contests never was the main goal... ;)

Nevertheless, my solution with lookup tables is
faster than any branch free computation and de-
finitively -much- faster than computations with
conditional jumps. The only flaw are misaligned
memory writes (required for proper formatting).
The hex to decimal functions surely are nothing
else than 'brute force' solutions, but they are
faster than my first approach using DIV. Should
be replaced by branchless mul_div-chains sooner
or later...

The entire core library (including DBE, Windoze
wrappers, memory management, file access, time&
date, automated spinbuttons and much more) fits
into less than 64 KB. How slim could one crunch
it down without reducing its functionality?

(Any association between "slim" and "schlimm"?)

> btw Bernhard, could you make PSHUFB work on your Phenom II.
> I couldn't so far. Is this good thing an Intel-only SSE3 ?
> Or may I miss some configuration setups ?

The latest 26568 manual lists PSHUFB/VPSHUFB as
SSSE3 instructions, present if

CPUID Fn0000_00001_ECX[SSSE3]

respective

CPUID Fn0000_00001_ECX[AVX]

are set. I think, both are not available on any
Phenom II, only family 15h manuals mention them
at the moment. Flying through the latencies for
the new processor generation, I found an inter-
esting entry - PUSH and POP are executed in one
clock cycle on family 15h processors. This is a
500 percent speed advantage compared to trivial
REG,MEM or MEM,REG moves, requiring five cycles
(same processor!). If this is not a typo, I was
interested how they managed to make a processor
execute 2 dependent operations (load/store REG,
then update RSP) simultaneously within a single
clock cycle, but a mundane memory move consumes
five clock cycles while it performs less work?

Let's hope, AMD does not end like Intel and be-
gins to count 0.25 clocks on a per pipe base...

Bernhard Schornak

unread,

Apr 25, 2011, 3:51:01 AM4/25/11

to

hopcode wrote:

> Hi und Frohe Ostern alle zusammen,

Und viele bunte Ostereier! ;)

> Il 24.04.2011 13:32, wolfgang kern ha scritto:
>> large set of options on proper formatted conversions!:)
>> Even still not using SSE my way seem to be faster+shorter.
>
> Yes, i agree (see source package on
> http://sourceforge.net/projects/x32lab/ file
> plugin\hex_code.asm of my hexviewer where i use
> a known method on 32bit general registers, ~35% faster
> than the MMXs variant); but SSE is quite common today,
> and the PSHUFs are very attractive instructions.
>
> After all, the way you convert it is not so relevant;
> take a look here for example,
> http://www.godevtool.com/TestbugHelp/Writinghex.htm
>
> What matters, imho, follows some few basic rules:
>
> a) separate the output formatting routine
> from the bare conversion routine.
>
> b) having a standard abstract way when returning
> infos after conversion (to allow faster/safer outputting)

As long as you use printf, this surely is no issue.
I bet printf eats up all improvements on the fly.

My programs convert all required data to strings in
step one, then call WinSetDlgItemText() repeatedly.
This allows much faster calls, because only one re-
gister (the numbers to convert or the control's ID)
changes for all required calls. I do not like lists
looking like this

ABC
EA674
0
12345BA8

or

-2147483648
4294967295
0

so my conversions -force- formatted output

0000 0ABC
000E A674
0000 0000
1234 5BA8

or

- 2 147 483 648
4 294 967 295
0

Depends on what people accept as readable, I guess.

Except the misaligned writes, it doesn't take -any-
extra cycle to apply the intended formatting (where
the lost cycles for misaligned access are 'covered'
by the write combining mechanism - the entire write
sequence is performed in 3 clocks on AMD family 10h
processors...).

> This allow you having *later* 0x00AABB or 0AABBh or
> 00000AABBh, or using this last known facility, 0AA'BBh
> and 0xAA'BB etc.
>
> Suprisingly is that i am totally used to read a lot of
> assembly (sic!) code that simply neglects the minimum
> abstraction requirements. What are they ? that may be a
> cool question. But every choice should have always one or
> more grounds for. Note, that i speak from my contextual
> experience of 3 years of assembly programming, also
> i am like a baby :-)
>
> Now using SSE, for example, give some more advantage when
> facing big bulk of userland-data-conversions in/from the
> unicode context. Being Unicode practically not an option,
> nor a choice, the rest follows /mainly/ automatically.
>
> But even there, in that unicode context, on 64bit
> i confess i make lot of use again of general registers yet,
> avoiding the SSE 4.2 quite uncommon set.

It is not a 'choice', but counting clock cycles and
knowing how the processor executes instructions. My
code (even if it looks quite weird sometimes) feeds
all three pipes of an Athlon (if possible) and pre-
loads registers -at least- three clocks before they
are used. As long as accessed memory is in L1, this
is an improvement you'll get for free by organizing
instructions properly. Knowing the FPU works simul-
taneously with the integer units, it's surely worth
a thought to use SSE instructions to perform a part
of a task while the integer pipes are busy with the
remaining stuff.

> Half squandered RAX register is worse than the ready excuse
> to continue 32bit programming in compatibility mode
> only because "64bit arent really needed nowadays".
>
> Immer dieselbe Leier, oder ? :-)

Wichtig ist, dass sich das Dingens weiterdreht - ob
das Sinn macht, ist ganz legal total egal. ;)

wolfgang kern

unread,

Apr 25, 2011, 3:24:09 AM4/25/11

to

Bernhard Schornak replied:
...
>>> http://code.google.com/p/st-open/source/browse/LIB/SOURCES/core/cvt.S

>>> All functions are compatible with Win-64
>>> calling conventions. Used registers (ex-
>>> cept RAX, XMM0...XMM3) are restored.

>> After viewing this bloatware-LIB I'm quite happy with my
>> large set of options on proper formatted conversions! :)
>> Even still not using SSE my way seem to be faster+shorter.

> To be honest: Conversions are the most utilised
> functions, so this was the first module I wrote
> on thin 64-bit ice - "mit der heissen Nadel ge-
> strickt". As long as the remaining parts of the
> libraries and applications aren't ported, there
> is no time left to recode them. It is important
> they work reliable in the first place - winning
> beauty contests never was the main goal... ;)

ok :)

> Nevertheless, my solution with lookup tables is
> faster than any branch free computation and de-
> finitively -much- faster than computations with
> conditional jumps. The only flaw are misaligned
> memory writes (required for proper formatting).
> The hex to decimal functions surely are nothing
> else than 'brute force' solutions, but they are
> faster than my first approach using DIV. Should
> be replaced by branchless mul_div-chains sooner
> or later...

I'm still within 32-bit for now, it takes some time to make
everthing from scratch for a 64-bit OS.
Me too tried first with lookup tables, but then figured that
code-inherent immediate constants for MOV,CMP,AND/OR were a
bit faster. And I can use all seven 32-bit GP-registers w/o
having issues with calling conventions or preservation needs.
So only a few 'locals' are required on stack and intermediate
temporary results were stored in the result buffer meanwhile.

> The entire core library (including DBE, Windoze
> wrappers, memory management, file access, time&
> date, automated spinbuttons and much more) fits
> into less than 64 KB. How slim could one crunch
> it down without reducing its functionality?

> (Any association between "slim" and "schlimm"?)

:) perhaps there once were a common sense in both.

>> btw Bernhard, could you make PSHUFB work on your Phenom II.
>> I couldn't so far. Is this good thing an Intel-only SSE3 ?
>> Or may I miss some configuration setups ?

> The latest 26568 manual lists PSHUFB/VPSHUFB as
> SSSE3 instructions, present if
>
> CPUID Fn0000_00001_ECX[SSSE3]
>
> respective
>
> CPUID Fn0000_00001_ECX[AVX]
>
> are set. I think, both are not available on any
> Phenom II, only family 15h manuals mention them
> at the moment.

I just stumbled over CPUID(1,ECX) read as 00802009h
but CPUID (April 2010) tells that only instructions
listed in APM4 are available.

> Flying through the latencies for
> the new processor generation, I found an inter-
> esting entry - PUSH and POP are executed in one
> clock cycle on family 15h processors. This is a
> 500 percent speed advantage compared to trivial
> REG,MEM or MEM,REG moves, requiring five cycles
> (same processor!). If this is not a typo, I was
> interested how they managed to make a processor
> execute 2 dependent operations (load/store REG,
> then update RSP) simultaneously within a single
> clock cycle, but a mundane memory move consumes
> five clock cycles while it performs less work?

On chip cache can be as fast as the CPU itself, while
external memory access needs to fiddle with addressing,
cache-checks, paging and not at least race for the
anyway slower RAM Bus.

> Let's hope, AMD does not end like Intel and be-
> gins to count 0.25 clocks on a per pipe base...

I'm sure that not even CPU-designers can tell any details
on instruction timing anymore :)

jeza gemma eier pecken und a bier nochlaan ...

__
wolfgang

wolfgang kern

unread,

Apr 25, 2011, 6:00:41 AM4/25/11

to

"hopcode" wrote:

> Hi und Frohe Ostern alle zusammen,

same in return!

> Il 24.04.2011 13:32, wolfgang kern ha scritto:
>> large set of options on proper formatted conversions!:)
>> Even still not using SSE my way seem to be faster+shorter.

> Yes, i agree (see source package on
> http://sourceforge.net/projects/x32lab/ file
> plugin\hex_code.asm

Couldn't get this one.

> of my hexviewer where i use
> a known method on 32bit general registers, ~35% faster
> than the MMXs variant); but SSE is quite common today,
> and the PSHUFs are very attractive instructions.

Sure, especially PSHUFB seems a powerful instruction.
Unfortunately not implied in my latest toy (Phenom II).

> After all, the way you convert it is not so relevant;
> take a look here for example,
> http://www.godevtool.com/TestbugHelp/Writinghex.htm

Yeah, there are many ways possible. I prefer the short 'and'
fast solutions even they may become machine specific then.

> What matters, imho, follows some few basic rules:

> a) separate the output formatting routine
> from the bare conversion routine.

> b) having a standard abstract way when returning
> infos after conversion (to allow faster/safer outputting)

> This allow you having *later* 0x00AABB or 0AABBh or
> 00000AABBh, or using this last known facility, 0AA'BBh
> and 0xAA'BB etc.

My standard show all leading zeros to see the size as well ie:
FFFF8000 or 00008000. An appended 'h' is optional, but without
an added zero in front of it, because 'my' HEX is never signed!

Ok for a fix sized source like EAX.
But I support numeric variables from 4-bit signed nibbles up to
512bit integers with 32bit 10**n exponents and can display them
in almost all possible formats like Enginering/Scientific/Padded/
Fixpoint/Pseudopoint/Normalised/hex/bcd/bin/oct incl. truncate/
round beside an automated or intentional fit into (editable input)
field option, not to forget the positioning options:
RSET/LSET/InString/TabSET/NumTabSet/DPTabSet/FieldAligned/Spaced...
and attributes for font,colours,borders,boxed,caption,buttons...

So for all the small figures which fit into 32bit and have not
too many attributes assigned, the conversion is done on the fly
in the output routine, this is where characters become dots on
screen or on the printer. :)

For the larger figures I use a buffer, perhaps just for debug
purpose, because I could bypass it and feed my output with
32-bit parts as well as I often use to display Z-ASCII-quads.

> Suprisingly is that i am totally used to read a lot of
> assembly (sic!) code that simply neglects the minimum
> abstraction requirements. What are they ? that may be a
> cool question. But every choice should have always one or
> more grounds for. Note, that i speak from my contextual
> experience of 3 years of assembly programming, also
> i am like a baby :-)

I may have lesser experience with ASM, I still type machine-code!
so 'abstraction' is a thing to avoid by all means for me.

> Now using SSE, for example, give some more advantage when
> facing big bulk of userland-data-conversions in/from the
> unicode context. Being Unicode practically not an option,
> nor a choice, the rest follows /mainly/ automatically.

Yes, many SSE-instructions became finally almost standard.
Even I personally find not much use for floats, the GPUs
seem to like it ...

UNIcode is almost useless in the western world, so I convert
the few special-European characters into extended(IBM)ASCII.

> But even there, in that unicode context, on 64bit
> i confess i make lot of use again of general registers yet,
> avoiding the SSE 4.2 quite uncommon set.

> Half squandered RAX register is worse than the ready excuse
> to continue 32bit programming in compatibility mode
> only because "64bit arent really needed nowadays".

x86-64 is perhaps an attempt to prepare everyone involved for
next generation CPUs. Could be a true 64-bit w/o backwards
compatible issues. At the moment this 'second' CPU isn't enough
apart of the first old x86 core. Seems we have to wait for a
genuine Hybrid Chip which contains two fully apart CPUs.

> Immer dieselbe Leier, oder ? :-)

Jo leida is a so.
__
wolfgang

Bernhard Schornak

unread,

Apr 26, 2011, 4:55:32 PM4/26/11

to

wolfgang kern wrote:

> Bernhard Schornak replied:
> ...

<snip>

>> Nevertheless, my solution with lookup tables is
>> faster than any branch free computation and de-
>> finitively -much- faster than computations with
>> conditional jumps. The only flaw are misaligned
>> memory writes (required for proper formatting).
>> The hex to decimal functions surely are nothing
>> else than 'brute force' solutions, but they are
>> faster than my first approach using DIV. Should
>> be replaced by branchless mul_div-chains sooner
>> or later...
>
> I'm still within 32-bit for now, it takes some time to make
> everthing from scratch for a 64-bit OS.

Same applies to any other software, as well. As
long as you don't use 'standard libraries', you
have to figure out how things work on your own.

> Me too tried first with lookup tables, but then figured that
> code-inherent immediate constants for MOV,CMP,AND/OR were a
> bit faster. And I can use all seven 32-bit GP-registers w/o
> having issues with calling conventions or preservation needs.
> So only a few 'locals' are required on stack and intermediate
> temporary results were stored in the result buffer meanwhile.

I do not split byte into nibbles - each byte is
translated in one gulp (512 byte table). I just
separate n byte, load n corresponding words and
write them to the desired positions in the out-
put buffer. This is one clock per read, one per
write + two clocks at the end of this procedure
for the last write (all 3 pipes busy all of the
time). Except B2hex() and W2hex(), translations
are present when they are written.

Unfortunately, you are bound to idiotic calling
conventions if you write something for existing
platforms. Compared to Win32, Win64 conventions
are more convenient for asm-programmers. What I
don't like is the increase of registers used as
garbage pile - you now have to save & restore 8
registers (versus two in good old Win32) if you
want to work in a clean environment. A properly
designed OS doesn't overwrite any register with
garbage (my opinion).

BTW: I use all GPs except rSP (fixed to the be-
ginning of a cache line on function entry). rAX
through rBP preferred for 32 bit, rNN preferred
for 64 bit (saves extra prefixes).

>> The entire core library (including DBE, Windoze
>> wrappers, memory management, file access, time&
>> date, automated spinbuttons and much more) fits
>> into less than 64 KB. How slim could one crunch
>> it down without reducing its functionality?
>
>> (Any association between "slim" and "schlimm"?)
> :) perhaps there once were a common sense in both.

An urban legend? ;)

>>> btw Bernhard, could you make PSHUFB work on your Phenom II.
>>> I couldn't so far. Is this good thing an Intel-only SSE3 ?
>>> Or may I miss some configuration setups ?
>
>> The latest 26568 manual lists PSHUFB/VPSHUFB as
>> SSSE3 instructions, present if
>>
>> CPUID Fn0000_00001_ECX[SSSE3]
>>
>> respective
>>
>> CPUID Fn0000_00001_ECX[AVX]
>>
>> are set. I think, both are not available on any
>> Phenom II, only family 15h manuals mention them
>> at the moment.
>
> I just stumbled over CPUID(1,ECX) read as 00802009h
> but CPUID (April 2010) tells that only instructions
> listed in APM4 are available.

Download AMD's latest manuals. They added a few
documents, e.g. the latest CPUID updates and an
optimization guide for Bulldozer/Zambezi. Chip-
set manuals still end with 7xx (there are MoBos
with 8xx chipset on the market, 9xx is ready to
be launched together with Zambezi).

Looking at latencies in the optimization guide,
the new processors are much slower than the old
Phenom II series, e.g. memory moves need 5 (vs.
3) clocks now. Let's wait, how Zambezi performs
under real life conditions.

>> Flying through the latencies for
>> the new processor generation, I found an inter-
>> esting entry - PUSH and POP are executed in one
>> clock cycle on family 15h processors. This is a
>> 500 percent speed advantage compared to trivial
>> REG,MEM or MEM,REG moves, requiring five cycles
>> (same processor!). If this is not a typo, I was
>> interested how they managed to make a processor
>> execute 2 dependent operations (load/store REG,
>> then update RSP) simultaneously within a single
>> clock cycle, but a mundane memory move consumes
>> five clock cycles while it performs less work?
>
> On chip cache can be as fast as the CPU itself, while
> external memory access needs to fiddle with addressing,
> cache-checks, paging and not at least race for the
> anyway slower RAM Bus.

??? A program's stack always resides in regular
memory. I guess, they just implemented some re-
gister renaming tricks to hide the required up-
date of rSP. Whenever the scheduler 'sees' PUSH
or POP, the old rSP is copied to a spare regis-
ter, and the original rSP is updated before the
PUSH or POP is executed. Still makes me wonder,
where those five cycles for the required memory
move (including all load/store mechanisms) dis-
appear to...

>> Let's hope, AMD does not end like Intel and be-
>> gins to count 0.25 clocks on a per pipe base...
>
> I'm sure that not even CPU-designers can tell any details
> on instruction timing anymore :)

At least _they_ should! ;)

> jeza gemma eier pecken und a bier nochlaan ...

I mog liaba an Espresso woas'd a jed's B”hnderl
oanz'ln rausschmeck'n kohsd.

Bernhard Schornak

unread,

Mar 5, 2012, 2:37:05 AM3/5/12

to

hopcode wrote:

>> A new enhanced version of this snippet on
>> x64lab Google group at
>>
>> https://groups.google.com/group/x64lab/browse_thread/thread/b72125ac10cc01e4?hl=en

While recoding my conversion libraries, I translated your
code. Following Win-64 calling conventions, RCX holds the
number, RDX the address of the target buffer:

.section .rdata, "dr"
.p2align 4,,15
tASC:.quad 0x6766656463626160, 0x7675747372716968

_q2h:movq _BNR(%rip), %rax
subq $0x38, %rsp
movq %r8, 0x30(%rsp)
movdqa tASC(%rip), %xmm2
movdqa tASC(%rip), %xmm3
pxor %xmm0, %xmm0
xorq %r8, %r8
bsr %rcx, %r8
shrq $0x02, %r8
bswap %rcx
subq $0x0F, %r8
negq %r8
movq %rcx, %xmm1
punpcklbw %xmm1, %xmm0
pand CVT_0F(%rax), %xmm0
punpcklbw %xmm1, %xmm1
psrlw $0x04, %xmm1
pand CVT_0F(%rax), %xmm1
psrlw $0x08, %xmm1
pshufb %xmm0, %xmm2
pshufb %xmm1, %xmm3
por %xmm3, %xmm2
psubb CVT_30(%rax), %xmm2
movhlps %xmm2, %xmm1
movq %xmm2, 0x00(%rdx, %r8)
movq %xmm1, 0x08(%rdx, %r8)
movb $0x30, 0x00(%rdx)
movq 0x30(%rsp), %r8
bswap %rcx
addq $0x38, %rsp
ret

Average latency: 13 clock cycles (FX-8150). This code has
one flaw, though. It replaces trailing zeroes (0x30) with
NULLs (0x00). An additional shift of XMM2 was required to
make this work properly (increasing latency by one or two
clocks...).

Removing the formatting part (BSR is slow) and applying
some tricks, average latency can be reduced to 6 clocks:

_q2h:movq _BNR(%rip), %rax
bswap %rcx
subq $0x08, %rsp
pxor %xmm0, %xmm0
movq %rcx, %xmm1
movdqa tASC(%rip), %xmm2
punpcklbw %xmm1, %xmm0
pand CVT_0F(%rax), %xmm0
punpcklbw %xmm1, %xmm1
movdqa tASC(%rip), %xmm3
psrlw $0x04, %xmm1
pand CVT_0F(%rax), %xmm1
psrlw $0x08, %xmm1
pshufb %xmm0, %xmm2
pshufb %xmm1, %xmm3
por %xmm3, %xmm2
psubb CVT_30(%rax), %xmm2
bswap %rcx
movdqa %xmm2, 0x00(%rdx)
movq $0x00, 0x10(%rdx)
addq $0x08, %rsp
ret

As mentioned in previous postings, PSHUFB isn't available
on all machines. The new SSE2 solution for my libraries:

_q2h:subq $0xF8, %rsp
movq _BNR(%rip), %rax
bswap %rcx
movdqa %xmm4, 0xD0(%rsp)
movdqa %xmm5, 0xE0(%rsp)
movq %rcx, %xmm0
movdqa CVT_30(%rax), %xmm2
movdqa CVT_09(%rax), %xmm3
movdqa %xmm0, %xmm1
movdqa CVT_0F(%rax), %xmm4
psrlq $0x04, %xmm1
movdqa CVT_07(%rax), %xmm5
punpcklbw %xmm0, %xmm1
pand %xmm4, %xmm1
movdqa %xmm1, %xmm0
pcmpgtb %xmm3, %xmm1
paddb %xmm2, %xmm0
pand %xmm5, %xmm1
movdqa 0xD0(%rsp), %xmm4
paddb %xmm1, %xmm0
movdqa 0xE0(%rsp), %xmm5
movq %xmm0, 0x00(%rdx)
psrldq $0x08, %xmm0
movb $0x20, 0x08(%rdx)
movdqu %xmm0, 0x09(%rdx)
movq $0x00, 0x18(%rdx)
bswap %rcx
xorl %eax, %eax
addq $0xF8, %rsp
ret

Average latency: 11 clock cycles. Removing the formatting
part reduces latency to 8 clocks.

'Missing' tables are stored in a global memory block BNR,
holding tables and global variables for the program. It's
loaded on program start and written back on exit.

hopcode

unread,

Mar 6, 2012, 3:47:21 AM3/6/12

to

Hallo Bernhard,
You have broken that holy _holy_ silence of CLAX... ;-)
sorry for my late response,
i am personally still involved in RNG math-theories,
but i see now a good end (or starting point) using
decimals from Pi, as i said.

Ok, Your routine replacing PSHUFB is ok,
whenever i can object that it is quite common
on 64bit machines (since Athlon, year 2006,
if i am not wrong). for the record i have
converted it to intel synthax

align 16
.q2h_const:
dq 3030303030303030h
dq 3030303030303030h
dq 0909090909090909h
dq 0909090909090909h
dq 0F0F0F0F0F0F0F0Fh
dq 0F0F0F0F0F0F0F0Fh
dq 0707070707070707h
dq 0707070707070707h

.q2h:
sub rsp,0F8h
mov rax,rdx
bswap rcx
movdqa [rsp+0D0h],xmm4
movdqa [rsp+0E0h],xmm5
movq xmm0,rcx
movdqa xmm2,[.q2h_const] ;--- 30
movdqa xmm3,[.q2h_const+10h] ;--- 09
movdqa xmm1,xmm0
movdqa xmm4,[.q2h_const+20h] ;--- 0F
psrlq xmm1,4
movdqa xmm5,[.q2h_const+30h] ;--- 07
punpcklbw xmm1,xmm0

pand xmm1,xmm4
movdqa xmm0,xmm1
pcmpgtb xmm1,xmm3
paddb xmm0,xmm2
pand xmm1,xmm5

movdqa xmm4,[rsp+0D0h]
paddb xmm0,xmm1
movdqa xmm5,[rsp+0E0h]
movq [rdx],xmm0
psrldq xmm0,8
mov byte[rdx+8],20h
xor eax,eax ;--- <---
movdqu [rdx+9],xmm0
mov [rdx+24],rax
bswap rcx
add rsp,0F8h
ret 0

but the matter is in formatting, imo.
because Yours, that outputs

00000000_00000123

using a space between the dword is already
biased to a particular use (for example the address
column in an hex editor); where mine on the contrary
gives the bare number out

0000000000000123

in this way one can convert it to utf16
or add/skip/space chars.
i wrote a new version of it, yet unpublished,
but i will set it online as far as i rlease
the first (but working) alfa of _x64lab_ Multilanguage
IDE for Windows, in 2 or 3 weeks.

Cheers,

btw:
just for curiosity, what do You use to align and justify
so fine the text in Your posts. does it work for German
language too the same way ? Could You share the algo ?

--
.:mrk[hopcode]
.:x64lab:.
group http://groups.google.com/group/x64lab
site http://sites.google.com/site/x64lab

Bernhard Schornak

unread,

Mar 6, 2012, 1:06:34 PM3/6/12

to

hopcode wrote:

> Hallo Bernhard,

Hi!

> You have broken that holy _holy_ silence of CLAX... ;-)

Sorry if I disturbed the snoozing community... ;)

> sorry for my late response,

No problem. Almost one year passed by since you posted
the last message in this thread, so it was up to me to
apologise.

> i am personally still involved in RNG math-theories,
> but i see now a good end (or starting point) using
> decimals from Pi, as i said.

Frohes Schaffen!

I probably will adopt one of Steven's algorithms. They
are pretty fast and sufficient for 'daily use'.

OMG... ;)

(I posted my translation because it did not work as it
should - might be an error in my translation or a bug
in your code.)

> but the matter is in formatting, imo.
> because Yours, that outputs
>
> 00000000_00000123
>
> using a space between the dword is already
> biased to a particular use (for example the address
> column in an hex editor); where mine on the contrary
> gives the bare number out
>
> 0000000000000123

Well, I consider my formatting as better readable than
yours. In large lists with hundreds of hex numbers, my
format is faster to read and comprehend. That trailing
'h' is confusing, 'cause the common conversion routine
in a human brain recognises it as a part of the number
rather than treating it as special marker at the first
glance. 0xNNNN is less ambiguous.

0123 4567 89AB CDEF 4656 5544 5665 5CFA
FEDC BA98 7654 3210 AEAD CBEF 7B9C DE45
5665 5CFA 7B9C DE45 0123 4567 FEDC BA98

001234567h 089ABCDEFh 046565544h 056655CFAh
0FEDCBA98h 076543210h 0AEADCBEFh 07B9CDE45h
056655CFAh 07B9CDE45h 001234567h 0FEDCBA98h

> in this way one can convert it to utf16
> or add/skip/space chars.

I prefer hardcoded conversion functions. They are much
faster and provide standardised output formats without
post processing. In general, these low level functions
should never need post processors rendering the gained
speed improvement to a slow motion re-play of the real
conversion. Do not split what can be done in one gulp.

> i wrote a new version of it, yet unpublished,
> but i will set it online as far as i rlease
> the first (but working) alfa of _x64lab_ Multilanguage
> IDE for Windows, in 2 or 3 weeks.

Speaking of multilangual applications:

http://code.google.com/p/st-open/downloads/detail?name=STbench.7z&can=2&q=

It's a little benchmark with 'built-in' multilingual
support. Switching between languages simply switches
between subfields of datafields. Language support is
an integral part of my libraries since V 5.0.0. (the
current version is V 8.1.0.). The latest version can
manage up to 256 languages (16 per datafield, reload
from associated language folder 0...F if required).

(Version 9.0.0. will be UTF16. Conversion and string
handling is simplified and those headaches caused by
unaligned addresses are cured forever.)

> Cheers,

Pfüat'Di!

> btw:
> just for curiosity, what do You use to align and justify
> so fine the text in Your posts. does it work for German
> language too the same way ? Could You share the algo ?

Brain V 1.0. from 1956. Required component, can't be
shared... ;)