; mask_0F dq 0F0F0F0F'0F0F0F0Fh,0F0F0F0F'0F0F0F0Fh
; mask_str dq 6766656463626160h,7675'7473'72716968h
; mask_30 dq 3030303030303030h,3030303030303030h
;--- usage ----------------
mov rdi,dest
mov rax,0123456789ABCDEFh
call .dq2a
ret
;--- proc
.dq2a:
movdqa xmm2,dqword [mask_str]
movdqa xmm3,xmm2
movdqa xmm4,dqword[mask_0F]
movdqa xmm5,dqword[mask_30]
bswap rax
mov qword[rdi],rax
punpcklbw xmm0,dqword[rdi]
pand xmm0,xmm4
punpcklbw xmm1,dqword[rdi]
psrlw xmm1,4
pand xmm1,xmm4
psrlw xmm1,8
pshufb xmm2,xmm0
pshufb xmm3,xmm1
por xmm2,xmm3
psubb xmm2,xmm5
movdqa dqword[rdi],xmm2
ret 0
Tell me what you think about
Cheers,
--
.:hopcode[marc:rainer:kranz]:.
x64 Assembly Lab
http://sites.google.com/site/x64lab
https://groups.google.com/group/x64lab/browse_thread/thread/b72125ac10cc01e4?hl=en
Il 23.04.2011 05:04, hopcode ha scritto:
> Hi All,
> want to share this snippet, SSE Converting DQWORD to string;
> having in RDI a 16 aligned destination buffer of at least
> 32 bytes, in AL/AX/EAX/RAX a value to convert to string,
> return in RDI a zero-padded string (on the left):
>
> ; mask_0F dq 0F0F0F0F'0F0F0F0Fh,0F0F0F0F'0F0F0F0Fh
> ; mask_str dq 6766656463626160h,7675'7473'72716968h
> ; mask_30 dq 3030303030303030h,3030303030303030h
>
> ;--- usage ----------------
>
> mov rdi,dest
> mov rax,0123456789ABCDEFh
> call .dq2a
> ret
>
> ;--- proc
> ..dq2a:
> A new enhanced version of this snippet on
> x64lab Google group at
>
> https://groups.google.com/group/x64lab/browse_thread/thread/b72125ac10cc01e4?hl=en
Another way, throwing properly formatted
hexadecimal and decimal numbers (old and
outdated version - will be replaced with
current code sooner or later):
http://code.google.com/p/st-open/source/browse/LIB/SOURCES/core/cvt.S
All functions are compatible with Win-64
calling conventions. Used registers (ex-
cept RAX, XMM0...XMM3) are restored.
Greetings from Augsburg
Bernhard Schornak
https://groups.google.com/group/x64lab/browse_thread/thread/b72125ac10cc01e4?hl=en
> http://code.google.com/p/st-open/source/browse/LIB/SOURCES/core/cvt.S
After viewing this bloatware-LIB I'm quite happy with my
large set of options on proper formatted conversions! :)
Even still not using SSE my way seem to be faster+shorter.
btw Bernhard, could you make PSHUFB work on your Phenom II.
I couldn't so far. Is this good thing an Intel-only SSE3 ?
Or may I miss some configuration setups ?
Bit 9(CR4) = 1, bit 10(CR4) = 0 and I think to have all
SIMD/FPU-exceptions masked off.
__
wolfgang
Il 24.04.2011 13:32, wolfgang kern ha scritto:
> large set of options on proper formatted conversions!:)
> Even still not using SSE my way seem to be faster+shorter.
Yes, i agree (see source package on
http://sourceforge.net/projects/x32lab/ file
plugin\hex_code.asm of my hexviewer where i use
a known method on 32bit general registers, ~35% faster
than the MMXs variant); but SSE is quite common today,
and the PSHUFs are very attractive instructions.
After all, the way you convert it is not so relevant;
take a look here for example,
http://www.godevtool.com/TestbugHelp/Writinghex.htm
What matters, imho, follows some few basic rules:
a) separate the output formatting routine
from the bare conversion routine.
b) having a standard abstract way when returning
infos after conversion (to allow faster/safer outputting)
This allow you having *later* 0x00AABB or 0AABBh or
00000AABBh, or using this last known facility, 0AA'BBh
and 0xAA'BB etc.
Suprisingly is that i am totally used to read a lot of
assembly (sic!) code that simply neglects the minimum
abstraction requirements. What are they ? that may be a
cool question. But every choice should have always one or
more grounds for. Note, that i speak from my contextual
experience of 3 years of assembly programming, also
i am like a baby :-)
Now using SSE, for example, give some more advantage when
facing big bulk of userland-data-conversions in/from the
unicode context. Being Unicode practically not an option,
nor a choice, the rest follows /mainly/ automatically.
But even there, in that unicode context, on 64bit
i confess i make lot of use again of general registers yet,
avoiding the SSE 4.2 quite uncommon set.
Half squandered RAX register is worse than the ready excuse
to continue 32bit programming in compatibility mode
only because "64bit arent really needed nowadays".
Immer dieselbe Leier, oder ? :-)
> Bernhard Schornak wrote:
>
>>> A new enhanced version of this snippet on
>>> x64lab Google group at
>
> https://groups.google.com/group/x64lab/browse_thread/thread/b72125ac10cc01e4?hl=en
>
>
>> Another way, throwing properly formatted
>> hexadecimal and decimal numbers (old and
>> outdated version - will be replaced with
>> current code sooner or later):
>
>> http://code.google.com/p/st-open/source/browse/LIB/SOURCES/core/cvt.S
>
>> All functions are compatible with Win-64
>> calling conventions. Used registers (ex-
>> cept RAX, XMM0...XMM3) are restored.
>
> After viewing this bloatware-LIB I'm quite happy with my
> large set of options on proper formatted conversions! :)
> Even still not using SSE my way seem to be faster+shorter.
To be honest: Conversions are the most utilised
functions, so this was the first module I wrote
on thin 64-bit ice - "mit der heissen Nadel ge-
strickt". As long as the remaining parts of the
libraries and applications aren't ported, there
is no time left to recode them. It is important
they work reliable in the first place - winning
beauty contests never was the main goal... ;)
Nevertheless, my solution with lookup tables is
faster than any branch free computation and de-
finitively -much- faster than computations with
conditional jumps. The only flaw are misaligned
memory writes (required for proper formatting).
The hex to decimal functions surely are nothing
else than 'brute force' solutions, but they are
faster than my first approach using DIV. Should
be replaced by branchless mul_div-chains sooner
or later...
The entire core library (including DBE, Windoze
wrappers, memory management, file access, time&
date, automated spinbuttons and much more) fits
into less than 64 KB. How slim could one crunch
it down without reducing its functionality?
(Any association between "slim" and "schlimm"?)
> btw Bernhard, could you make PSHUFB work on your Phenom II.
> I couldn't so far. Is this good thing an Intel-only SSE3 ?
> Or may I miss some configuration setups ?
The latest 26568 manual lists PSHUFB/VPSHUFB as
SSSE3 instructions, present if
CPUID Fn0000_00001_ECX[SSSE3]
respective
CPUID Fn0000_00001_ECX[AVX]
are set. I think, both are not available on any
Phenom II, only family 15h manuals mention them
at the moment. Flying through the latencies for
the new processor generation, I found an inter-
esting entry - PUSH and POP are executed in one
clock cycle on family 15h processors. This is a
500 percent speed advantage compared to trivial
REG,MEM or MEM,REG moves, requiring five cycles
(same processor!). If this is not a typo, I was
interested how they managed to make a processor
execute 2 dependent operations (load/store REG,
then update RSP) simultaneously within a single
clock cycle, but a mundane memory move consumes
five clock cycles while it performs less work?
Let's hope, AMD does not end like Intel and be-
gins to count 0.25 clocks on a per pipe base...
> Hi und Frohe Ostern alle zusammen,
Und viele bunte Ostereier! ;)
> Il 24.04.2011 13:32, wolfgang kern ha scritto:
>> large set of options on proper formatted conversions!:)
>> Even still not using SSE my way seem to be faster+shorter.
>
> Yes, i agree (see source package on
> http://sourceforge.net/projects/x32lab/ file
> plugin\hex_code.asm of my hexviewer where i use
> a known method on 32bit general registers, ~35% faster
> than the MMXs variant); but SSE is quite common today,
> and the PSHUFs are very attractive instructions.
>
> After all, the way you convert it is not so relevant;
> take a look here for example,
> http://www.godevtool.com/TestbugHelp/Writinghex.htm
>
> What matters, imho, follows some few basic rules:
>
> a) separate the output formatting routine
> from the bare conversion routine.
>
> b) having a standard abstract way when returning
> infos after conversion (to allow faster/safer outputting)
As long as you use printf, this surely is no issue.
I bet printf eats up all improvements on the fly.
My programs convert all required data to strings in
step one, then call WinSetDlgItemText() repeatedly.
This allows much faster calls, because only one re-
gister (the numbers to convert or the control's ID)
changes for all required calls. I do not like lists
looking like this
ABC
EA674
0
12345BA8
or
-2147483648
4294967295
0
so my conversions -force- formatted output
0000 0ABC
000E A674
0000 0000
1234 5BA8
or
- 2 147 483 648
4 294 967 295
0
Depends on what people accept as readable, I guess.
Except the misaligned writes, it doesn't take -any-
extra cycle to apply the intended formatting (where
the lost cycles for misaligned access are 'covered'
by the write combining mechanism - the entire write
sequence is performed in 3 clocks on AMD family 10h
processors...).
> This allow you having *later* 0x00AABB or 0AABBh or
> 00000AABBh, or using this last known facility, 0AA'BBh
> and 0xAA'BB etc.
>
> Suprisingly is that i am totally used to read a lot of
> assembly (sic!) code that simply neglects the minimum
> abstraction requirements. What are they ? that may be a
> cool question. But every choice should have always one or
> more grounds for. Note, that i speak from my contextual
> experience of 3 years of assembly programming, also
> i am like a baby :-)
>
> Now using SSE, for example, give some more advantage when
> facing big bulk of userland-data-conversions in/from the
> unicode context. Being Unicode practically not an option,
> nor a choice, the rest follows /mainly/ automatically.
>
> But even there, in that unicode context, on 64bit
> i confess i make lot of use again of general registers yet,
> avoiding the SSE 4.2 quite uncommon set.
It is not a 'choice', but counting clock cycles and
knowing how the processor executes instructions. My
code (even if it looks quite weird sometimes) feeds
all three pipes of an Athlon (if possible) and pre-
loads registers -at least- three clocks before they
are used. As long as accessed memory is in L1, this
is an improvement you'll get for free by organizing
instructions properly. Knowing the FPU works simul-
taneously with the integer units, it's surely worth
a thought to use SSE instructions to perform a part
of a task while the integer pipes are busy with the
remaining stuff.
> Half squandered RAX register is worse than the ready excuse
> to continue 32bit programming in compatibility mode
> only because "64bit arent really needed nowadays".
>
> Immer dieselbe Leier, oder ? :-)
Wichtig ist, dass sich das Dingens weiterdreht - ob
das Sinn macht, ist ganz legal total egal. ;)
>>> All functions are compatible with Win-64
>>> calling conventions. Used registers (ex-
>>> cept RAX, XMM0...XMM3) are restored.
>> After viewing this bloatware-LIB I'm quite happy with my
>> large set of options on proper formatted conversions! :)
>> Even still not using SSE my way seem to be faster+shorter.
> To be honest: Conversions are the most utilised
> functions, so this was the first module I wrote
> on thin 64-bit ice - "mit der heissen Nadel ge-
> strickt". As long as the remaining parts of the
> libraries and applications aren't ported, there
> is no time left to recode them. It is important
> they work reliable in the first place - winning
> beauty contests never was the main goal... ;)
ok :)
> Nevertheless, my solution with lookup tables is
> faster than any branch free computation and de-
> finitively -much- faster than computations with
> conditional jumps. The only flaw are misaligned
> memory writes (required for proper formatting).
> The hex to decimal functions surely are nothing
> else than 'brute force' solutions, but they are
> faster than my first approach using DIV. Should
> be replaced by branchless mul_div-chains sooner
> or later...
I'm still within 32-bit for now, it takes some time to make
everthing from scratch for a 64-bit OS.
Me too tried first with lookup tables, but then figured that
code-inherent immediate constants for MOV,CMP,AND/OR were a
bit faster. And I can use all seven 32-bit GP-registers w/o
having issues with calling conventions or preservation needs.
So only a few 'locals' are required on stack and intermediate
temporary results were stored in the result buffer meanwhile.
> The entire core library (including DBE, Windoze
> wrappers, memory management, file access, time&
> date, automated spinbuttons and much more) fits
> into less than 64 KB. How slim could one crunch
> it down without reducing its functionality?
> (Any association between "slim" and "schlimm"?)
:) perhaps there once were a common sense in both.
>> btw Bernhard, could you make PSHUFB work on your Phenom II.
>> I couldn't so far. Is this good thing an Intel-only SSE3 ?
>> Or may I miss some configuration setups ?
> The latest 26568 manual lists PSHUFB/VPSHUFB as
> SSSE3 instructions, present if
>
> CPUID Fn0000_00001_ECX[SSSE3]
>
> respective
>
> CPUID Fn0000_00001_ECX[AVX]
>
> are set. I think, both are not available on any
> Phenom II, only family 15h manuals mention them
> at the moment.
I just stumbled over CPUID(1,ECX) read as 00802009h
but CPUID (April 2010) tells that only instructions
listed in APM4 are available.
> Flying through the latencies for
> the new processor generation, I found an inter-
> esting entry - PUSH and POP are executed in one
> clock cycle on family 15h processors. This is a
> 500 percent speed advantage compared to trivial
> REG,MEM or MEM,REG moves, requiring five cycles
> (same processor!). If this is not a typo, I was
> interested how they managed to make a processor
> execute 2 dependent operations (load/store REG,
> then update RSP) simultaneously within a single
> clock cycle, but a mundane memory move consumes
> five clock cycles while it performs less work?
On chip cache can be as fast as the CPU itself, while
external memory access needs to fiddle with addressing,
cache-checks, paging and not at least race for the
anyway slower RAM Bus.
> Let's hope, AMD does not end like Intel and be-
> gins to count 0.25 clocks on a per pipe base...
I'm sure that not even CPU-designers can tell any details
on instruction timing anymore :)
jeza gemma eier pecken und a bier nochlaan ...
__
wolfgang
> Hi und Frohe Ostern alle zusammen,
same in return!
> Il 24.04.2011 13:32, wolfgang kern ha scritto:
>> large set of options on proper formatted conversions!:)
>> Even still not using SSE my way seem to be faster+shorter.
> Yes, i agree (see source package on
> http://sourceforge.net/projects/x32lab/ file
> plugin\hex_code.asm
Couldn't get this one.
> of my hexviewer where i use
> a known method on 32bit general registers, ~35% faster
> than the MMXs variant); but SSE is quite common today,
> and the PSHUFs are very attractive instructions.
Sure, especially PSHUFB seems a powerful instruction.
Unfortunately not implied in my latest toy (Phenom II).
> After all, the way you convert it is not so relevant;
> take a look here for example,
> http://www.godevtool.com/TestbugHelp/Writinghex.htm
Yeah, there are many ways possible. I prefer the short 'and'
fast solutions even they may become machine specific then.
> What matters, imho, follows some few basic rules:
> a) separate the output formatting routine
> from the bare conversion routine.
> b) having a standard abstract way when returning
> infos after conversion (to allow faster/safer outputting)
> This allow you having *later* 0x00AABB or 0AABBh or
> 00000AABBh, or using this last known facility, 0AA'BBh
> and 0xAA'BB etc.
My standard show all leading zeros to see the size as well ie:
FFFF8000 or 00008000. An appended 'h' is optional, but without
an added zero in front of it, because 'my' HEX is never signed!
Ok for a fix sized source like EAX.
But I support numeric variables from 4-bit signed nibbles up to
512bit integers with 32bit 10**n exponents and can display them
in almost all possible formats like Enginering/Scientific/Padded/
Fixpoint/Pseudopoint/Normalised/hex/bcd/bin/oct incl. truncate/
round beside an automated or intentional fit into (editable input)
field option, not to forget the positioning options:
RSET/LSET/InString/TabSET/NumTabSet/DPTabSet/FieldAligned/Spaced...
and attributes for font,colours,borders,boxed,caption,buttons...
So for all the small figures which fit into 32bit and have not
too many attributes assigned, the conversion is done on the fly
in the output routine, this is where characters become dots on
screen or on the printer. :)
For the larger figures I use a buffer, perhaps just for debug
purpose, because I could bypass it and feed my output with
32-bit parts as well as I often use to display Z-ASCII-quads.
> Suprisingly is that i am totally used to read a lot of
> assembly (sic!) code that simply neglects the minimum
> abstraction requirements. What are they ? that may be a
> cool question. But every choice should have always one or
> more grounds for. Note, that i speak from my contextual
> experience of 3 years of assembly programming, also
> i am like a baby :-)
I may have lesser experience with ASM, I still type machine-code!
so 'abstraction' is a thing to avoid by all means for me.
> Now using SSE, for example, give some more advantage when
> facing big bulk of userland-data-conversions in/from the
> unicode context. Being Unicode practically not an option,
> nor a choice, the rest follows /mainly/ automatically.
Yes, many SSE-instructions became finally almost standard.
Even I personally find not much use for floats, the GPUs
seem to like it ...
UNIcode is almost useless in the western world, so I convert
the few special-European characters into extended(IBM)ASCII.
> But even there, in that unicode context, on 64bit
> i confess i make lot of use again of general registers yet,
> avoiding the SSE 4.2 quite uncommon set.
> Half squandered RAX register is worse than the ready excuse
> to continue 32bit programming in compatibility mode
> only because "64bit arent really needed nowadays".
x86-64 is perhaps an attempt to prepare everyone involved for
next generation CPUs. Could be a true 64-bit w/o backwards
compatible issues. At the moment this 'second' CPU isn't enough
apart of the first old x86 core. Seems we have to wait for a
genuine Hybrid Chip which contains two fully apart CPUs.
> Immer dieselbe Leier, oder ? :-)
Jo leida is a so.
__
wolfgang
> Bernhard Schornak replied:
> ...
<snip>
>> Nevertheless, my solution with lookup tables is
>> faster than any branch free computation and de-
>> finitively -much- faster than computations with
>> conditional jumps. The only flaw are misaligned
>> memory writes (required for proper formatting).
>> The hex to decimal functions surely are nothing
>> else than 'brute force' solutions, but they are
>> faster than my first approach using DIV. Should
>> be replaced by branchless mul_div-chains sooner
>> or later...
>
> I'm still within 32-bit for now, it takes some time to make
> everthing from scratch for a 64-bit OS.
Same applies to any other software, as well. As
long as you don't use 'standard libraries', you
have to figure out how things work on your own.
> Me too tried first with lookup tables, but then figured that
> code-inherent immediate constants for MOV,CMP,AND/OR were a
> bit faster. And I can use all seven 32-bit GP-registers w/o
> having issues with calling conventions or preservation needs.
> So only a few 'locals' are required on stack and intermediate
> temporary results were stored in the result buffer meanwhile.
I do not split byte into nibbles - each byte is
translated in one gulp (512 byte table). I just
separate n byte, load n corresponding words and
write them to the desired positions in the out-
put buffer. This is one clock per read, one per
write + two clocks at the end of this procedure
for the last write (all 3 pipes busy all of the
time). Except B2hex() and W2hex(), translations
are present when they are written.
Unfortunately, you are bound to idiotic calling
conventions if you write something for existing
platforms. Compared to Win32, Win64 conventions
are more convenient for asm-programmers. What I
don't like is the increase of registers used as
garbage pile - you now have to save & restore 8
registers (versus two in good old Win32) if you
want to work in a clean environment. A properly
designed OS doesn't overwrite any register with
garbage (my opinion).
BTW: I use all GPs except rSP (fixed to the be-
ginning of a cache line on function entry). rAX
through rBP preferred for 32 bit, rNN preferred
for 64 bit (saves extra prefixes).
>> The entire core library (including DBE, Windoze
>> wrappers, memory management, file access, time&
>> date, automated spinbuttons and much more) fits
>> into less than 64 KB. How slim could one crunch
>> it down without reducing its functionality?
>
>> (Any association between "slim" and "schlimm"?)
> :) perhaps there once were a common sense in both.
An urban legend? ;)
>>> btw Bernhard, could you make PSHUFB work on your Phenom II.
>>> I couldn't so far. Is this good thing an Intel-only SSE3 ?
>>> Or may I miss some configuration setups ?
>
>> The latest 26568 manual lists PSHUFB/VPSHUFB as
>> SSSE3 instructions, present if
>>
>> CPUID Fn0000_00001_ECX[SSSE3]
>>
>> respective
>>
>> CPUID Fn0000_00001_ECX[AVX]
>>
>> are set. I think, both are not available on any
>> Phenom II, only family 15h manuals mention them
>> at the moment.
>
> I just stumbled over CPUID(1,ECX) read as 00802009h
> but CPUID (April 2010) tells that only instructions
> listed in APM4 are available.
Download AMD's latest manuals. They added a few
documents, e.g. the latest CPUID updates and an
optimization guide for Bulldozer/Zambezi. Chip-
set manuals still end with 7xx (there are MoBos
with 8xx chipset on the market, 9xx is ready to
be launched together with Zambezi).
Looking at latencies in the optimization guide,
the new processors are much slower than the old
Phenom II series, e.g. memory moves need 5 (vs.
3) clocks now. Let's wait, how Zambezi performs
under real life conditions.
>> Flying through the latencies for
>> the new processor generation, I found an inter-
>> esting entry - PUSH and POP are executed in one
>> clock cycle on family 15h processors. This is a
>> 500 percent speed advantage compared to trivial
>> REG,MEM or MEM,REG moves, requiring five cycles
>> (same processor!). If this is not a typo, I was
>> interested how they managed to make a processor
>> execute 2 dependent operations (load/store REG,
>> then update RSP) simultaneously within a single
>> clock cycle, but a mundane memory move consumes
>> five clock cycles while it performs less work?
>
> On chip cache can be as fast as the CPU itself, while
> external memory access needs to fiddle with addressing,
> cache-checks, paging and not at least race for the
> anyway slower RAM Bus.
??? A program's stack always resides in regular
memory. I guess, they just implemented some re-
gister renaming tricks to hide the required up-
date of rSP. Whenever the scheduler 'sees' PUSH
or POP, the old rSP is copied to a spare regis-
ter, and the original rSP is updated before the
PUSH or POP is executed. Still makes me wonder,
where those five cycles for the required memory
move (including all load/store mechanisms) dis-
appear to...
>> Let's hope, AMD does not end like Intel and be-
>> gins to count 0.25 clocks on a per pipe base...
>
> I'm sure that not even CPU-designers can tell any details
> on instruction timing anymore :)
At least _they_ should! ;)
> jeza gemma eier pecken und a bier nochlaan ...
I mog liaba an Espresso woas'd a jed's B”hnderl
oanz'ln rausschmeck'n kohsd.