Faster HexToBuffer Routines

spam...@crayne.org

unread,

Aug 31, 2006, 7:08:13 PM8/31/06

to

Hi All,
I am currently in the process of refactoring several routines in the
HLA Standard Library. This is mainly a clean-up exercise I'm doing in
preparation for porting the library to FreeBSD. However, while I'm at
it, I'm also looking for ways to speed up some of the routines in the
library that where not written with performance in mind. The routine
I'm currently working on is one that should be near and dear to most
people's hearts -- an integer to hexadecimal representation conversion
routine. In particular, I'm posting the 64-bit (qword) version here,
though the 32-bit and 128-bit versions are quite similar (the 8-bit and
16-bit versions do a zero extend and call the 32-bit version, so they
don't count).

The following two routines are *internal* routines to the library. That
is, users typically do not call these routines directly, they typically
call a round like stdout.putdSize or str.putdSize and those library
routines call these internal routines (there's nothing stopping anyone
from calling these routines directly, but they make certain assumptions
about parameters passed into them without checking for error, so
they're not good candidates for public routines).

The first routine in question is _hexToBuf64Size which has the
following prototype:

procedure conv._hexToBuf64Size
(
q :qword;
width :dword;
numWidth :uns32;
fill :char;
var buffer :var;
maxBufSize :dword
);

q is the qword value to convert to hexadecimal format.

width is the minimum field width in which to print the value (including
padding if this width is greater than the number of print positions the
number requires); if this number is negative, the hex digits are left
justified in the field, else they are right justified in the field.

numWidth is the *exact* number of characters the number requires, not
including padding but including underscores, if appropriate.

fill is the padding character to use if abs(width) is greater than
numWidth.

buffer is where the routine will store the resultant characters. It
should contain at least as many bytes as max( abs(width), numWidth ).

maxBufSize is the number of bytes allocated in the buffer.

The second routine (which _hexToBuf64Size calls) is _hexToBuf64 and it
has the following prototype:

procedure conv._hexToBuf64
(
q :qword;
width :dword;
var buffer :char in edi
);

The parameters have the same meaning as for _hexToBuf64Size.

The _hexToBuf64Size routine will raise an exception if it attempts to
write more than maxBufSize bytes to the buffer. It will also raise an
exception if width is greater than 1023 (this is an arbitrary, but
reasonable limitation). It will also raise an exception if numWidth is
an inappropriate value (either too large, or one of the "magic" numbers
that cannot be specified when outputting values with underscores in
them).

Note that the lower-level conversion routine that hexToBuf64Size calls
(hexToBuf64) checks a global symbol (global with respect to the HLA
stdlib) called OutputUnderscores and if true, will emit an underscore
between each block of four digits (to the right) found in the number.
If OutputUnderscores is true, then numWidth must accurately reflect the
number of print positions INCLUDING THE UNDERSCORES. This means that
certain values of numWidth are illegal if OutputUnderscores is true. In
particular, 5, 9, and 13 are illegal values (in addition to zero and
all values greater than 19). This is because once you cross the four
digit (or eight digit or 12 digit) barrier, you add two character
positions--one for the digit and one for the underscore.

The actual hexToBuf64 routine uses a very simple straight-line code
algorithm (with a jump table) to emit the digits (up to 16 of them).
There are two code paths, one for OutputUnderscores true and one for
false, so that no extra testing is needed. This is an example of where
I'm willing to sacrifice a little space for a little speed. I also use
a lookup table to convert values in the range 0..$F to their
corresponding hexadecimal characters. Yeah, lookup tables can be slow
if they're not in the cache, but if the table is not in the cache then
this routine isn't being called very often and performance won't really
matter anyway. OTOH, when converting a lot of digits or when this
routine is being called frequently, the values will be in the cache.

One concern I do have is that this code uses the SHR( n, reg32 );
instruction quite frequently and performance of this instruction kind
of stinks on PIV, IIRC. It would be interesting to get opinions on the
instruction mix choice for other CPUs and whether some other
alternative would be better.

I haven't worried about instruction scheduling or anything like that
because most rules of this type apply only to certain processors. I
want a nice generic set of routines that work well across all x86 CPUs.

The code follows.
Cheers,
Randy Hyde

// I, Randall Hyde, hereby agree to waive all claim of copyright
(economic
// and moral) in all content contributed by me, the user, and
immediately
// place any and all contributions by me into the public domain; I
grant
// anyone the right to use my work for any purpose, without any
// conditions, to be changed or destroyed in any manner whatsoever
// without any attribution or notice to the creator. I also absolve
myself
// of any responsibility for the use of this code, the user assumes all
// responsibilities for using this software in an appropriate manner.
//
// Notice of declaration of public domain, 8/17/2006, by Randall Hyde

unit ConvUnit;

#include( "../include/conversions.hhf" )
#include( "stdlibdata.hhf" )

/****************************************************************************/
/*
*/
/* _hexToBuf64-
*/
/*
*/
/* On entry:
*/
/* EDX:EAX contains a numeric value to convert to a hexadecimal
string. */
/* ECX contains the number of digits to print (from L.O.->H.O.) but
does */
/* not include a count for the underscore if one is to be inserted.
*/
/*
*/
/* EDI points at the end of a memory buffer large enough to hold a
*/
/* 32-bit hexadecimal value (at least 16 if OutputUnderscores is
false, */
/* at least 19 bytes if OutputUnderscores is true).
*/
/*
*/
/* On exit:
*/
/* The buffer will contain a zero-terminated string that is the
*/
/* hexadecimal representation of the value and EDI will point at the
*/
/* start of the string. If outputUnderscores is true this routine will
*/
/* emit an underscore between groups of four hexadecimal digits.
*/
/*
*/
/****************************************************************************/

procedure conv._hexToBuf64
(
q :qword;
width :dword;
var buffer :char in edi
);
@noframe;
@nodisplay;
@noalignstack;

var
esiSave :dword; // These are organized so that the MOV
edxSave :dword; // instructions below access these
ecxSave :dword; // variables from lowest address to
ebxSave :dword; // highest address (better for cache).
eaxSave :dword;

readonly
htb64noUS :dword[17] :=
[
&badWidth,
&noUS1,
&noUS2,
&noUS3,
&noUS4,
&noUS5,
&noUS6,
&noUS7,
&noUS8,
&noUS9,
&noUS10,
&noUS11,
&noUS12,
&noUS13,
&noUS14,
&noUS15,
&noUS16
];

htb64hasUS :dword[20] :=
[
&badWidth,
&hasUS1,
&hasUS2,
&hasUS3,
&hasUS4,
&badWidth,
&hasUS6,
&hasUS7,
&hasUS8,
&hasUS9,
&badWidth,
&hasUS11,
&hasUS12,
&hasUS13,
&hasUS14,
&badWidth,
&hasUS16,
&hasUS17,
&hasUS18,
&hasUS19
];

#macro emitXDigit( src, digit, posn );

mov( src, ebx );
shr( digit*4, ebx );
and( $f, ebx ); // Strip out unwanted bits.
mov( stdlib.HexDigits[ebx], dl ); // Convert digit to hex
char.
mov( dl, [edi-posn] );

#endmacro

begin _hexToBuf64;

push( ebp );
mov( esp, ebp );
sub( _vars_, esp );
mov( eax, eaxSave ); // Intel recommends MOVs rather
mov( ebx, ebxSave ); // than pushes and pops.
mov( ecx, ecxSave );
mov( edx, edxSave );
mov( esi, esiSave );

mov( width, ecx );
mov( (type dword q), eax ); // ESI:EAX is the number to convert
mov( (type dword q[4]), esi );

// Handle output with underscores later in this file:

cmp( OutputUnderscores, 0 );
jne DoUnderscores;

// Drop down here if we're not outputting underscores inbetween
groups of
// four digits in the number.
//
// Max width is 16 character positions:

cmp( ecx, 16 );
ja badWidth;

// Jump to one of the following labels based on the
// output size of the number:

jmp( htb64noUS[ ecx*4 ] );

noUS16:
emitXDigit(esi,7,15);
noUS15:
emitXDigit(esi,6,14);
noUS14:
emitXDigit(esi,5,13);
noUS13:
emitXDigit(esi,4,12);
noUS12:
emitXDigit(esi,3,11);
noUS11:
emitXDigit(esi,2,10);
noUS10:
emitXDigit(esi,1,9);

noUS9:
and( $f, esi ); // Strip out unwanted bits.
mov( stdlib.HexDigits[esi], dl ); // Convert digit to hex
char.
mov( dl, [edi-8] );

noUS8:
emitXDigit(eax,7,7);
noUS7:
emitXDigit(eax,6,6);
noUS6:
emitXDigit(eax,5,5);
noUS5:
emitXDigit(eax,4,4);
noUS4:
emitXDigit(eax,3,3);
noUS3:
emitXDigit(eax,2,2);
noUS2:
emitXDigit(eax,1,1);
noUS1:
and( $f, eax ); // Strip out unwanted bits.
mov( stdlib.HexDigits[eax], dl ); // Convert digit to hex
char.
mov( dl, [edi] );

sub( ecx, edi ); // Point edi at start of
buffer + 1
jmp htbDone;

// Version of the above code that emits underscores between
// every four digits. Yep, repeated code (ugly), but done
// because the underscore processing is slower.

DoUnderscores:

// Drop down here if we're not outputting underscores inbetween
groups of
// four digits in the number.
//
// Max width is 16 character positions:

cmp( ecx, 19 );
ja badWidth;

// Jump to one of the following labels based on the
// output size of the number:

jmp( htb64hasUS[ ecx*4 ] );

hasUS19:
emitXDigit(esi,7,18);
hasUS18:
emitXDigit(esi,6,17);
hasUS17:
emitXDigit(esi,5,16);
hasUS16:
emitXDigit(esi,4,15);

mov( '_', (type char [edi-14]));

hasUS14:
emitXDigit(esi,3,13);
hasUS13:
emitXDigit(esi,2,12);
hasUS12:
emitXDigit(esi,1,11);
hasUS11:
and( $f, esi ); // Strip out unwanted bits.
mov( stdlib.HexDigits[esi], dl ); // Convert digit to hex
char.
mov( dl, [edi-10] );

mov( '_', (type char [edi-9]));

hasUS9:
emitXDigit(eax,7,8);
hasUS8:
emitXDigit(eax,6,7);
hasUS7:
emitXDigit(eax,5,6);
hasUS6:
emitXDigit(eax,4,5);

mov( '_', (type char [edi-4]));

hasUS4:
emitXDigit(eax,3,3);
hasUS3:
emitXDigit(eax,2,2);
hasUS2:
emitXDigit(eax,1,1);
hasUS1:
and( $f, eax ); // Strip out unwanted bits.
mov( stdlib.HexDigits[eax], dl ); // Convert digit to hex
char.
mov( dl, [edi] );

sub( ecx, edi ); // Point edi at start of
buffer + 1

htbDone:
add( 1, edi ); // Point back at first char in buffer.
mov( eaxSave, eax );
mov( ebxSave, ebx );
mov( ecxSave, ecx );
mov( edxSave, edx );
mov( esiSave, esi );
leave();
ret( _parms_ );

badWidth:
raise( ex.WidthTooBig );

end _hexToBuf64;

end ConvUnit;

// I, Randall Hyde, hereby agree to waive all claim of copyright
(economic
// and moral) in all content contributed by me, the user, and
immediately
// place any and all contributions by me into the public domain; I
grant
// anyone the right to use my work for any purpose, without any
// conditions, to be changed or destroyed in any manner whatsoever
// without any attribution or notice to the creator. I also absolve
myself
// of any responsibility for the use of this code, the user assumes all
// responsibilities for using this software in an appropriate manner.
//
// Notice of declaration of public domain, 8/17/2006, by Randall Hyde

unit ConvUnit;

#include( "../include/conversions.hhf" )
#include( "stdlibdata.hhf" )

/*****************************************************************************/
/*
*/
/* _hexToBuf64Size:
*/
/*
*/
/* Converts a qword value to a hexadecimal string and places that
*/
/* string in a buffer. Automatically adds padding to fill the buffer
*/
/* with the specified fill character.
*/
/*
*/
/* q-
*/
/* Value to convert to a hexadecimal string
*/
/*
*/
/* width-
*/
/* Minimum number of print position to use. If this number is
negative, */
/* the value is left-justified in the padded field. If this number
is */
/* positive, the output is right-justified in the field. If the
abso- */
/* lute value of this number is less than numWidth, then
numWidth's */
/* value is used.
*/
/*
*/
/* numWidth-
*/
/* Actual number of print positions the number (plus underscores,
*/
/* if activated) will consume.
*/
/*
*/
/* fill-
*/
/* The padding character to use.
*/
/*
*/
/* buffer-
*/
/* Pointer to the buffer where this routine will store it's
*/
/* resulting string. Note that the caller must ensure that
*/
/* the buffer is large enough to hold the converted string and
*/
/* any padding characters.
*/
/*
*/
/* maxBufSize-
*/
/* Size of the destination buffer.
*/
/*
*/
/* Returns:
*/
/*
*/
/* EDI-
*/
/* Points at the start of the converted characters in the
buffer. */
/*
*/
/* ECX-
*/
/* Contains the number of characters found in the buffer.
*/
/*
*/
/*****************************************************************************/

procedure conv._hexToBuf64Size
(
q :qword;
width :dword;
numWidth :uns32;
fill :char;
var buffer :var;
maxBufSize :dword
);
@noframe;
@nodisplay;

begin _hexToBuf64Size;

push( ebp );
mov( esp, ebp );
push( eax );
push( ebx );

mov( maxBufSize, edi );

// If width is negative, we left justify the string in the buffer,
// else we right justify the string in the buffer.

cmp( width, 0 );
jl leftJustify;

// Use the larger of the user-specified width and actual width
// for the number:

mov( numWidth, ebx );
cmp( ebx, width );
jae haveLargestWidth;

mov( width, ebx );

haveLargestWidth:
mov( ebx, width );

// Limit the width to maxBufSize characters:

cmp( ebx, edi );
jae wtb;

// Compute the number of leading pad characters we'll need:

sub( numWidth, ebx );

// Convert the hex value to a string and put it at
// the end of the buffer:

add( buffer, edi );
push( edi ); // Save, to compute string length later.
sub( 1, edi );
conv._hexToBuf64( q, numWidth, [edi] );

// Fill in the required number of characters before
// the string to pad it appropriately with the fill char.

test( ebx, ebx );
jz noPad;
mov( fill, al );
padLoop:
sub( 1, edi );
mov( al, [edi] );
sub( 1, ebx );
jnz padLoop;

noPad:
pop( ecx );
sub( edi, ecx ); // Compute string size
jmp hexBufSizeDone;

leftJustify:

neg( width ); // Compute abs(width)

// Use the larger of the user-specified width and actual width
// for the number:

mov( numWidth, ebx );
cmp( ebx, width );
jae _haveLargestWidth;

mov( width, ebx );

_haveLargestWidth:
mov( ebx, width );

// Limit the width to maxBufSize characters:

cmp( ebx, edi );
jae wtb;

// Compute the number of trailing pad characters we'll need:

add( buffer, edi );
push( edi );

sub( numWidth, ebx );
jz _noPad;
mov( fill, al );
_padLoop:
sub( 1, edi );
mov( al, [edi] );
sub( 1, ebx );
jnz _padLoop;

_noPad:

// Convert the hex value to a string and put it at
// just before the padding characters:

sub( 1, edi );
conv._hexToBuf64( q, numWidth, [edi] );

pop( ecx ); // Compute the length of the
sub( edi, ecx ); // resultant string.

hexBufSizeDone:
pop( ebx );
pop( eax );
pop( ebp );
ret( _parms_ );

wtb:
raise( ex.WidthTooBig );

end _hexToBuf64Size;

end ConvUnit;

Herbert Kleebauer

unread,

Sep 1, 2006, 2:33:17 PM9/1/06

to

"rand...@earthlink.net" wrote:

> I am currently in the process of refactoring several routines in the
> HLA Standard Library. This is mainly a clean-up exercise I'm doing in
> preparation for porting the library to FreeBSD. However, while I'm at

Now I understand why you think an assembly beginner can't write
his own int2hex routine, 500 lines of code!!!

> it, I'm also looking for ways to speed up some of the routines in the
> library that where not written with performance in mind. The routine

Why has this routine to be speed optimized? Isn't HLA designed as
a tool for learning/teaching assembly programming? Then you should
optimize the code for readability and not for speed.

> routine. In particular, I'm posting the 64-bit (qword) version here,

What sense makes a 64 bit version on a 32 bit processor. If you have to
load the 64 bit number in two register before calling the function, then
you also can call the 32 bit version twice instead.

> width is the minimum field width in which to print the value (including

Because hex number are printed with leading zeros, the length of
the string is obvious and therefore a "field width" doesn't make much sense
(this would be different for decimal output with leading zeros suppressed).

> buffer is where the routine will store the resultant characters. It
> should contain at least as many bytes as max( abs(width), numWidth ).
>
> maxBufSize is the number of bytes allocated in the buffer.

> The _hexToBuf64Size routine will raise an exception if it attempts to

> write more than maxBufSize bytes to the buffer. It will also raise an

The calling code is responsible for providing an appropriate buffer, why
waste time in the function to test for a buffer overflow.

> The actual hexToBuf64 routine uses a very simple straight-line code
> algorithm (with a jump table) to emit the digits (up to 16 of them).

A very simple straight-line code algorithm (which any beginner
immediately understand) would be:

; r0 value to convert (8, 16 or 32 bit)
; r1 pointer to buffer (r1 updated, no other register modified)
out_hex_l: ror.l #16,r0
bsr.l out_hex_w
ror.l #16,r0
out_hex_w: ror.w #8,r0
bsr.l out_hex_b
ror.w #8,r0
out_hex_b: ror.b #4,r0
bsr.l out_hex_n
ror.b #4,r0
out_hex_n: move.l r0,-(sp)
andq.l #$0f,r0
move.b _buf(r0),r0
move.b r0,(r1)
addq.l #1,r1
move.l (sp)+,r0
rts.l
_buf: dc.b "0123456789abcdef"

It can convert 8, 16 and 32 bit values to hex. And if you want
an underscore after each byte, insert it in the out_hex_b code.

Rod Pemberton

unread,

Sep 1, 2006, 4:06:26 PM9/1/06

to

"rand...@earthlink.net" <spam...@crayne.org> wrote in message
news:1157065693.1...@i42g2000cwa.googlegroups.com...
> Hi All,
<snip>

> One concern I do have is that this code uses the SHR( n, reg32 );
> instruction quite frequently and performance of this instruction kind
> of stinks on PIV, IIRC. It would be interesting to get opinions on the
> instruction mix choice for other CPUs and whether some other
> alternative would be better.
>

Instead of alot of shifts, I usually use two registers:

dohex32: mov ecx,eax
and eax,00F0F0F0Fh
or eax,030303030h
shr ecx,4
and ecx,00F0F0F0Fh
or ecx,030303030h

call next0
shr eax,16
shr ecx,16
call next0
ret

next0: cmp ch, 03Ah
jl short next1
add ch, 007h

next1: cmp ah, 03Ah
jl short next2
add ah, 007h

next2: cmp cl, 03Ah
jl short next3
add cl, 007h

next3: cmp al, 03Ah
jl short next4
add al, 007h

next4: ; display routine

Of course, there is plenty of room for improvement. A small
lookup/translation table could eliminate the cmp/jmp/add...

Rod Pemberton

Charles A. Crayne

unread,

Sep 1, 2006, 5:06:00 PM9/1/06

to

On Fri, 01 Sep 2006 20:33:17 +0200
Herbert Kleebauer <spam...@crayne.org> wrote:

:A very simple straight-line code algorithm (which any beginner

:immediately understand) would be:
:
: ; r0 value to convert (8, 16 or 32 bit)
: ; r1 pointer to buffer (r1 updated, no other register modified)

As an additional example, I am still using essentially the same routine
which I wrote almost 25 years ago, albeit updated for 32-bit registers:

;hex to ascii routines
;eax value to be converted
;esi ->result string
;returns esi->next position
hexdd: push eax
shr eax,16 ;do high word first
call hexdw
pop eax
hexdw: push eax
shr eax,8 ;do high byte first
call hexdb
pop eax
hexdb: push eax
shr eax,4 ;do high nibble first
call hexdn
pop eax
hexdn: and eax,0fh ;isolate nibble
add al,'0' ;convert to ascii
cmp al,'9' ;valid digit?
jbe hexdn1 ;yes
add al,7 ;use alpha range
hexdn1: mov [esi],al ;store result
inc esi ;next position
ret

spam...@crayne.org

unread,

Sep 6, 2006, 5:20:24 PM9/6/06

to

Charles A. Crayne wrote:
> On Fri, 01 Sep 2006 20:33:17 +0200
> Herbert Kleebauer <spam...@crayne.org> wrote:
>
> :A very simple straight-line code algorithm (which any beginner
> :immediately understand) would be:
> :
> : ; r0 value to convert (8, 16 or 32 bit)
> : ; r1 pointer to buffer (r1 updated, no other register modified)
>
> As an additional example, I am still using essentially the same routine
> which I wrote almost 25 years ago, albeit updated for 32-bit registers:

Quite similar to the code I've got in the HLA stdlib v1.x. Code that
I've been using since about 1985 when I first started developing the
library code I eventually called "The UCR Standard Library for 80x86
Assembly Language Programmers" (which ultimately morphed into the HLA
stdlib). However, in my current project I questioning all the code I've
written to see if I can write it better. I'm more interested in
exploring the reasons for using assembly language (e.g., speed) rather
than "how easily can I knock this routine one." The easy approach is
great for code you write once as part of an app and toss out into the
world. For library code, that gets called over and over again, I feel I
should pay a little more attention to the code and see if there isn't a
better solution (better, in this particular case, I'm defining as
"faster").
Cheers,
Randy Hyde

spam...@crayne.org

unread,

Sep 6, 2006, 5:15:29 PM9/6/06

to

Herbert Kleebauer wrote:
> "rand...@earthlink.net" wrote:
>
> > I am currently in the process of refactoring several routines in the
> > HLA Standard Library. This is mainly a clean-up exercise I'm doing in
> > preparation for porting the library to FreeBSD. However, while I'm at
>
> Now I understand why you think an assembly beginner can't write
> his own int2hex routine, 500 lines of code!!!

Actually, if you look at the original routines, they were simplistic as
you suggest. But one reason for updating the stdlib functions is that
they are black box code and should be written to be efficient. I happen
to have chosen *speed* as the thing I want to optimize rather than
space. This definitely isn't *beginner* code I'm writing here, but
that's a big advantage to calling the stdlib, you won't get "beginner
quality" code.

>
>
> > it, I'm also looking for ways to speed up some of the routines in the
> > library that where not written with performance in mind. The routine
>
> Why has this routine to be speed optimized?

Because that's what people expect from assembly language.
Speed optimization is far more valuable to most people than space
optimization, given that we're talking *hundreds* of bytes here on a
machine that typically has more than 512 million of them. Alas, cycles
are as freely obtainable on most modern machines.

> Isn't HLA designed as
> a tool for learning/teaching assembly programming?

Irrelevant.
And you're missing the point that the HLA stdlib can be called from
other assemblers as well. Indeed, one of the big shifts in the new
version of the stdlib (v2.0) is to provide more than lip service to
other assembler users. You can't count on all of them being beginners.
And even if they were, why should beginners have to suffer with slow
code?

> Then you should
> optimize the code for readability and not for speed.

Actually, I'm headed in the opposite direction. I'm *reducing* the
readability. One other change that I'm making in the v1.x -> v2.0
transition is that I'm eliminating all the HLL-like control structures
(except for procedure calls) that are present in the v1.x library. This
is being done explicitly because some people *do* look at the code as
an example of how to write assembly language code and the presence of
HLL control structures (be they macro invocations, statements built
into the assembler, whatever) aren't good examples. The HLL code may
be more readable, but that's not what the code should present.

As for using dumbed-down algorithms that are easy to understand, just
note that the purpose of the HLA stdlib is not to teach people how to
implement watered-down algorithms in assembly language. It's a black
box that allows people to *ignore* how those things are implemented.
Yep, I'm making the code a bit more advanced than the typical beginning
programmer might want to read. So what? If this code were appearing in
AoA or some other (not advanced) assembly book, you might have a point.
That's not the case, however.

Of course, if I *did* continue to use the old code that was trivial but
slow, I'd be hearing complaints about how crappy the library code is
and how beginners ought to stay away from it :-)

>
>
> > routine. In particular, I'm posting the 64-bit (qword) version here,
>
> What sense makes a 64 bit version on a 32 bit processor. If you have to
> load the 64 bit number in two register before calling the function, then
> you also can call the 32 bit version twice instead.

You might try reading the code. You'll quickly discover several things.
In particular, the OutputUnderscores variable doesn't allow such a
trivial implementation.
Furthermore, the 64-bit conversion can be done *faster* (marginally,
but faster nonetheless) when it knows it's outputting a complete 64-bit
number.
And finally, you might not that the routine does *not* pass the data in
registers. The 32-bit (and smaller) versions do, but not the 64-bit
and 128-bit versions.

>
>
> > width is the minimum field width in which to print the value (including
>
> Because hex number are printed with leading zeros, the length of
> the string is obvious and therefore a "field width" doesn't make much sense
> (this would be different for decimal output with leading zeros suppressed).

Who says hex numbers are printed with leading zeros? This is a
convention that certain output routines adhere to, but there certainly
isn't a requirement like this that I've ever seen. Indeed, one of the
*main* reasons I've rewritten the hexadecimal output routines is
*because* the original ones did just as you claim and I've seen cases
where that is *not* desired behavior. You will note, btw, that if you
*want* the leading zeros, they are easily achieved by setting the
number width and field width to the same value and supplying a fill
character of zero.

>
>
> > buffer is where the routine will store the resultant characters. It
> > should contain at least as many bytes as max( abs(width), numWidth ).
> >
> > maxBufSize is the number of bytes allocated in the buffer.
>
> > The _hexToBuf64Size routine will raise an exception if it attempts to
> > write more than maxBufSize bytes to the buffer. It will also raise an
>
> The calling code is responsible for providing an appropriate buffer, why
> waste time in the function to test for a buffer overflow.

Because the routine calling _hexToBuf64Size may *not* be the one
allocating the buffer. For example, the HLA stdlib conv.dToStr(s)
routine gets passed an HLA string variable whose data storage has been
preallocated by the caller. Now the dToStr routine could be the one to
check for buffer overflow, but that would be duplicated code (as every
routine that calls dToStr would have to do this). Duplicated code is
bad. Better to put the duplicated code in one routine.

>
>
> > The actual hexToBuf64 routine uses a very simple straight-line code
> > algorithm (with a jump table) to emit the digits (up to 16 of them).
>
>
> A very simple straight-line code algorithm (which any beginner
> immediately understand) would be:

And a much slower one, too.
You do realize the cost of all those call instructions, don't you?
The purpose of the code is not to provide an example for beginners.
They don't need to understand how it works in order to *use* it. They
just need to know how to call it. Later, when they get a bit more
advanced, they can figure out how it works. Ultimately, then, they
don't pay the price of using a hex to string conversion routine that
was written in a simplistic manner so they could understand it when
they were a beginner.

>
> ; r0 value to convert (8, 16 or 32 bit)
> ; r1 pointer to buffer (r1 updated, no other register modified)
> out_hex_l: ror.l #16,r0
> bsr.l out_hex_w
> ror.l #16,r0
> out_hex_w: ror.w #8,r0
> bsr.l out_hex_b
> ror.w #8,r0
> out_hex_b: ror.b #4,r0
> bsr.l out_hex_n
> ror.b #4,r0
> out_hex_n: move.l r0,-(sp)
> andq.l #$0f,r0
> move.b _buf(r0),r0
> move.b r0,(r1)
> addq.l #1,r1
> move.l (sp)+,r0
> rts.l
> _buf: dc.b "0123456789abcdef"
>
> It can convert 8, 16 and 32 bit values to hex. And if you want
> an underscore after each byte, insert it in the out_hex_b code.

Except, of course, it *always* pads the number with leading zeros
(which the specifications for _hexToBufXX don't require). And it's
quite a bit slower (as already noted).

Now *most* of the time, speed doesn't matter. After all, the majority
of the time the numeric value being converted to a string is being
written to a file or, worse, to a bit-mapped display, which is 1000s of
times slower than the conversion routine itself. But on those cases
where the conversion is strictly in-memory and speed really does
matter, it's nice to have a fast routine. The straight-line code I
wrote is about an order of magnitude faster than the recursive code
I've written in the past and several times faster than the example
you've given. For the few extra bytes my routine consumes, I can live
with the larger size for the faster code. When speed doesn't matter
(e.g., writing to a file), the size of the routine probably isn't going
to matter either. I can see in certain embedded systems where the size
would matter, but then, it's the engineer's responsibility to make
appropriate choices in such cases.

Cheers,
Randy Hyde

Betov

unread,

Sep 7, 2006, 3:41:55 AM9/7/06

to

"rand...@earthlink.net" <spam...@crayne.org> écrivait
news:1157577624.1...@m79g2000cwm.googlegroups.com:

> For library code, that gets called over and over again, I feel I
> should pay a little more attention to the code and see if there isn't a
> better solution (better, in this particular case, I'm defining as
> "faster")

Logical.

:]]]]]

For the real Assembly Programmers who may need speed, they
will simply NOT use a Lib Functionality, but, instead, will
tailor the one matching exactly to what they are doing.

Betov.

< http://rosasm.org >

Bertrand Augereau

unread,

Sep 15, 2006, 6:43:27 AM9/15/06

to

Ok, this isn't totally the original stuff, but here is a dummy
implentation in MSVC/GCC intrinsics (sorry, I'm at work, I don't have
time for converting this to nice clean asm myself, but it is easy) of
DWORD to buffer you might like, without branching nor lookup table.
Of course it assumes the buffer is properly aligned... and it doesn't
EMMS :)
I haven't timed it but it might be at least of a pedagogical value, I
guess :)

Cheers!

static const __m64 threshold = _mm_set1_pi8 (9);
static const __m64 addForNumeric = _mm_set1_pi8 ('0');
static const __m64 addForAlpha = _mm_set1_pi8 ('A' - 10);

void DwordToHex (DWORD v, char* buffer)
{

DWORD hi = _byteswap_ulong(v & 0x0F0F0F0F);
DWORD lo = _byteswap_ulong((v & 0xF0F0F0F0) >> 4);
__m64 lo64 = _mm_cvtsi32_si64 (lo);
__m64 hi64 = _mm_cvtsi32_si64 (hi);

__m64 unpacked = _mm_unpacklo_pi8 ( lo64, hi64);

__m64 unpackedNumeric = _mm_add_pi8(addForNumeric, unpacked);
__m64 unpackedAlpha = _mm_add_pi8(addForAlpha, unpacked);

__m64 mask = _mm_cmpgt_pi8 (unpacked, threshold);

__m64 resultAlpha = _mm_and_si64(mask, unpackedAlpha);
__m64 resultNumeric = _mm_andnot_si64(mask, unpackedNumeric);
__m64 result = _mm_or_si64(resultAlpha, resultNumeric);
*(__m64*)buffer = result;

}

MSVC 2003 gives this:

push ebp
mov ebp, esp
and esp, -8 ; fffffff8H

mov eax, DWORD PTR _v$[ebp]
mov edx, DWORD PTR _buffer$[ebp]
mov ecx, eax
shr eax, 4
and ecx, 252645135 ; 0f0f0f0fH
and eax, 252645135 ; 0f0f0f0fH
bswap ecx
bswap eax
movd mm0, ecx
movd mm1, eax
punpcklbw mm1, mm0
movq mm0, MMWORD PTR _threshold
movq mm2, mm1
pcmpgtb mm2, mm0
movq mm0, mm2
movq mm2, MMWORD PTR _addForNumeric
paddb mm2, mm1
movq mm3, mm0
pandn mm3, mm2
movq mm2, MMWORD PTR _addForAlpha
paddb mm2, mm1
pand mm0, mm2
por mm0, mm3
movq MMWORD PTR [edx], mm0

mov esp, ebp
pop ebp
ret 0

spam...@crayne.org

unread,

Sep 15, 2006, 8:17:12 PM9/15/06

to

Bertrand Augereau wrote:
> I haven't timed it but it might be at least of a pedagogical value, I
> guess :)
>
> Cheers!
>

[ MMX solution snipped]

Yep, the MMX solution kicks butt.
However, I've been loathe to use MMX stuff in the HLA stdlib for three
reasons -- (1) interference with the FPU (which can be avoided by going
with the equivalent SSE instructions), (2) the concern about being able
to run the resultant code on any CPU, and (3) preserving the state of
the machine (e.g., saving registers) is expensive when using MMX. This
is *definitely* one area where a "specific" solution would work to
one's advantage (specific, in this case, meaning you know it's going to
run on a CPU with the necessary hardware and that the MMX registers are
available for tinkering with).

In the future, I may (via conditional assembly) create an SSE-enabled
version of the HLA stdlib that treats a few of the SSE registers as
volatile across calls to the stdlib. Then someone can link in that
version if they don't mind being limited to CPUs that support the
particular SSE instructions I use.

Thanks for the code, though.
Cheers,
Randy Hyde

Bertrand Augereau

unread,

Sep 15, 2006, 9:18:22 PM9/15/06

to

Butts don't need that much kicking :) and I am aware of the pitfalls of
using SIMD for these tasks.
It was just an occasion for showing how to avoid branching with por,
pand | pnand
EMMS is (not that) expensive for just one conversion but it can become
negligible if you processs a whole buffer and keep the constants value
in registers.
And it's tedious to keep track of FPU/MMX register state in library code.
(Yet you'll be hard pressed finding a CPU that doesn't run MMX code now.
At work we sacrifize those and assume at least MMX and SSE1 for gaming
machines, even though MMX/FPU aliasing is tedious to get right).

Good luck with HLA 2,

Cheers,
Bertrand

spam...@crayne.org

unread,

Sep 20, 2006, 1:00:16 AM9/20/06

to

Bertrand Augereau wrote:
> Ok, this isn't totally the original stuff, but here is a dummy
> implentation in MSVC/GCC intrinsics (sorry, I'm at work, I don't have
> time for converting this to nice clean asm myself, but it is easy) of
> DWORD to buffer you might like, without branching nor lookup table.
> Of course it assumes the buffer is properly aligned... and it doesn't
> EMMS :)
> I haven't timed it but it might be at least of a pedagogical value, I
> guess :)
>

Ouch!
The MMX version had *terrible* timing on my PIV system.
Here's the code I used:

program t;
#include( "stdlib.hhf" )

procedure _hexToBuf32
(
d :dword;
width :dword;
var buf :char in edi
);
@noframe;
@nodisplay;
@noalignstack;

readonly(16)

#macro x8(_v_):_i_, _result_;
?_result_ := 0;
#for( _i_ := 0 to 7 )

?_result_ := (_result_ << 8) | (_v_);

#endfor
_result_
#endmacro

_threshold :qword := x8( 9 );
_addForNumeric :qword := x8( uns8('0') );
_addForAlpha :qword := x8( uns8('A')-10 );

copyBytes :dword[9] :=
[
&badWidth,
&width1,
&width2,
&width3,
&width4,
&width5,
&width6,
&width7,
&width8
];

var
buffer :byte[16];

begin _hexToBuf32;

push( ebp );
mov( esp, ebp );
sub( _vars_, esp );

push( eax );
push( ecx );
push( edx );

mov( d, eax );
mov( eax, edx );
shr( 4, eax );
and( $0f0f_0f0f, edx );
and( $0f0f_0f0f, eax );
bswap( edx );
bswap( eax );
movd( edx, mm0 );
movd( eax, mm1 );
punpcklbw( mm0, mm1 );
movq( _threshold, mm0 );
movq( mm1, mm2 );
pcmpgtb( mm0, mm2 );
movq( mm2, mm0 );
movq( _addForNumeric, mm2 );
paddb( mm1, mm2 );
movq( mm0, mm3 );
pandn( mm2, mm3 );
movq( _addForAlpha, mm2 );
paddb( mm1, mm2 );
pand( mm2, mm0 );
por( mm3, mm0 );
movq( mm0, (type qword buffer) );
mov( width, ecx );
cmp( ecx, 8 );
ja badWidth;
jmp( copyBytes[ecx*4] );

width8:
mov( buffer[0], al );
mov( al, [edi-7] );

width7:
mov( buffer[1], al );
mov( al, [edi-6] );

width6:
mov( buffer[2], al );
mov( al, [edi-5] );

width5:
mov( buffer[3], al );
mov( al, [edi-4] );

width4:
mov( buffer[4], al );
mov( al, [edi-3] );

width3:
mov( buffer[5], al );
mov( al, [edi-2] );

width2:
mov( buffer[6], al );
mov( al, [edi-1] );

width1:
mov( buffer[7], al );
mov( al, [edi-0] );

htb32Done:
pop( edx );
pop( ecx );
pop( eax );
leave();
ret( _parms_ );

badWidth:
raise( ex.WidthTooBig );

end _hexToBuf32;

procedure __hexToBuf32
(
d :dword in eax;
width :dword in ecx;
var buffer :byte in edi
);
@noframe;
@nodisplay;
@noalignstack;

readonly
noUSjt :dword[9] :=

[
&badWidth,
&noUS1,
&noUS2,
&noUS3,
&noUS4,
&noUS5,
&noUS6,
&noUS7,
&noUS8

];

hexDigits :char[16] :=
[
'0', '1', '2', '3',
'4', '5', '6', '7',
'8', '9', 'a', 'b',
'c', 'd', 'e', 'f'
];

hexTbl :byte[512] :=
[
?i:byte := 0;
#while(i <= 254)
#if( i < $10 )
'0',
#else
char( @substr( string(byte(i)), 0, 1) ),
#endif
char( @substr( string(byte(i)), 1, 1) ),
?i := i + 1;
#endwhile
'F', 'F'
];

#macro emit2XDigits( shift, posn );

mov( eax, edx );
shr( shift, edx );
and( $ff, edx );
mov( (type word hexTbl[edx*2]), dx );
mov( dx, [edi-posn] );

#endmacro

begin __hexToBuf32;

push( eax );
push( ecx );
push( edx );

// If we're outputting underscores, then go handle the output
// using a special version of this routine. (Note:
OutputUnderscores is
// a global object in the HLA Standard Library.)

cmp( ecx, 8 );
ja badWidth;

jmp( noUSjt[ecx*4] );

noUS8:
emit2XDigits( 24, 7);
noUS6:
emit2XDigits( 16, 5);
noUS4:
emit2XDigits( 8, 3 );
noUS2:
and( $ff, eax );
mov( (type word hexTbl[eax*2]), dx );
mov( dx, [edi-1] );
jmp htbDone;

noUS7:
emit2XDigits( 20, 6 );
noUS5:
emit2XDigits( 12, 4 );
noUS3:
emit2XDigits( 4, 2 );

noUS1:
and( $f, eax ); // Strip out unwanted bits.

mov( hexDigits[eax], dl ); // Convert digit to hex char.
mov( dl, [edi] );

htbDone:
sub( ecx, edi ); // Point EDI at the first char in the buffer
add( 1, edi );
pop( edx );
pop( ecx );
pop( eax );
ret();

badWidth:
raise( ex.WidthTooBig );

end __hexToBuf32;

var
start :dword[2];
time1 :dword[2];
time2 :dword[2];
buffer :byte[16];

begin t;

rdtsc();
mov( eax, start );
mov( edx, start[4] );
xor( eax, eax );
loopit1:

lea( edi, buffer[15] );
__hexToBuf32( eax, 8, [edi] );
add( 1, eax );
jnz loopit1;

rdtsc();
sub( start, eax );
sbb( start[4], edx );
mov( eax, time1 );
mov( edx, time1[4] );

rdtsc();
mov( eax, start );
mov( edx, start[4] );
xor( eax, eax );
loopit2:

lea( edi, buffer[15] );
_hexToBuf32
(
eax,
8,
[edi]
);
add( 1, eax );
jnz loopit2;

rdtsc();
sub( start, eax );
sbb( start[4], edx );
mov( eax, time2 );
mov( edx, time2[4] );
stdout.put( "Time1: " );
stdout.putd( time1[4] );
stdout.putd( time1[0] );
stdout.newln();
stdout.put( "Time2: " );
stdout.putd( time2[4] );
stdout.putd( time2[0] );
stdout.newln();

end t;

Here's the output:
Time1: 00000016249C5C48
Time2: 00000096C559D1DC

Granted, the MMX version *could* be sped up a little bit, but the
non-MMX version is about 6x faster! It would be interesting to see
what went wrong here.
Cheers,
Randy Hyde

Bertrand Augereau

unread,

Sep 21, 2006, 3:23:47 AM9/21/06

to

Very interesting! :)
Could you VTune the code a bit?
Maybe because the store to load forwarding to buffer? (qword storing ->
bytes loading)

I might check today at work if I have time

Cheers,
Bertrand

rand...@earthlink.net a écrit :