Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Optimizing DCPCrypt Library (AES and Tiger) for newer processors/instruction sets.

112 views
Skip to first unread message

Skybuck Flying

unread,
Mar 14, 2008, 1:44:14 PM3/14/08
to
Hello,

My software uses DCPCrypt Library for Delphi.

Specifically the following algorithms are used:

AES (Advanced Encryption Standard)
Tiger (192 bit Hash)

I looked at the source code some time ago and it seems it simply use int64's
which are simply implemented by the delphi compiler with simulated 64 bit
integers/multi-byte arithmetic, which is kinda good because this allows
backwards compatibility.

However newer processors have special instructions like mmx, sse, sse2, etc.

I wonder if the library can be extended to include special classess which
could be used for newer processor when detected.

So some general questions which I haven't looked into... but later I will
look into it.

For now I ask the people on these newsgroups:

1. Does AES benefit from SSE instruction set ? Can performance be increased
without comprising security too much. Especially ECB mode ;)

2. Does Tiger benefit from SSE instruction set ? Can it's performance be
increased ?

Thanks for any insight or ideas.

Bye,
Skybuck.


Skybuck Flying

unread,
Mar 14, 2008, 1:54:13 PM3/14/08
to
I am gonna add one more question:

3. Would CRC32 calculations benefit from SSE or any other new instruction
set ?

Bye,
Skybuck.


Skybuck Flying

unread,
Mar 14, 2008, 2:41:12 PM3/14/08
to
Well, so far I have found Crypto++ Library 5.5.2, open source, public domain
c++ stuff ;)

It has some assembler functions for tiger and aes and such.

Little comments though.

Bye,
Skybuck.


Skybuck Flying

unread,
Mar 14, 2008, 2:50:42 PM3/14/08
to
There is something I don't understand about this library:

It has this AS stuff everywhere like:

AS2( pxor xmm##x1, [inputPtr+p1*16])\
AS2( pxor xmm##x2, [inputPtr+p2*16])\
AS2( pxor xmm##x3, [inputPtr+p3*16])\
AS2( add inputPtr, increment*16)\
ASC( jmp, labelPrefix##3)\
ASL(labelPrefix##7)\
AS2( movdqu xmm##t, [inputPtr+p0*16])\
AS2( pxor xmm##x0, xmm##t)\
AS2( movdqu xmm##t, [inputPtr+p1*16])\
AS2( pxor xmm##x1, xmm##t)\
AS2( movdqu xmm##t, [inputPtr+p2*16])\
AS2( pxor xmm##x2, xmm##t)\

It seems some kind of macro ?

When I go to the definition it reads:

#define AS1(x) __asm {x}
#define AS2(x, y) __asm {x, y}
#define AS3(x, y, z) __asm {x, y, z}
#define ASS(x, y, a, b, c, d) __asm {x, y, _MM_SHUFFLE(a, b, c, d)}

What the hell does that do ???

Bye,
Skybuck.


Skybuck Flying

unread,
Mar 14, 2008, 2:53:29 PM3/14/08
to
Oh wait, now I understand, it's a macro that simply modifies the asm syntax
for the compiler.

This is for when different compilers are used.

Apperently C compilers use different asm syntaxes:

It becomes clear when looking at the complete definitions:

Especially the GNUC stuff... it seems it require seperators like ; and
strange stuff or whatever ;) :)

#ifdef CRYPTOPP_GENERATE_X64_MASM
#define AS1(x) x*newline*
#define AS2(x, y) x, y*newline*
#define AS3(x, y, z) x, y, z*newline*
#define ASS(x, y, a, b, c, d) x, y, a*64+b*16+c*4+d*newline*
#define ASL(x) label##x:*newline*
#define ASJ(x, y, z) x label##y*newline*
#define ASC(x, y) x label##y*newline*
#define AS_HEX(y) y##h
#elif defined(__GNUC__)
// define these in two steps to allow arguments to be expanded
#define GNU_AS1(x) #x ";"
#define GNU_AS2(x, y) #x ", " #y ";"
#define GNU_AS3(x, y, z) #x ", " #y ", " #z ";"
#define GNU_ASL(x) "\n" #x ":"
#define GNU_ASJ(x, y, z) #x " " #y #z ";"
#define AS1(x) GNU_AS1(x)
#define AS2(x, y) GNU_AS2(x, y)
#define AS3(x, y, z) GNU_AS3(x, y, z)
#define ASS(x, y, a, b, c, d) #x ", " #y ", " #a "*64+" #b "*16+" #c "*4+"
#d ";"
#define ASL(x) GNU_ASL(x)
#define ASJ(x, y, z) GNU_ASJ(x, y, z)
#define ASC(x, y) #x " " #y ";"
#define CRYPTOPP_NAKED
#define AS_HEX(y) 0x##y
#else


#define AS1(x) __asm {x}
#define AS2(x, y) __asm {x, y}
#define AS3(x, y, z) __asm {x, y, z}
#define ASS(x, y, a, b, c, d) __asm {x, y, _MM_SHUFFLE(a, b, c, d)}

#define ASL(x) __asm {label##x:}
#define ASJ(x, y, z) __asm {x label##y}
#define ASC(x, y) __asm {x label##y}
#define CRYPTOPP_NAKED __declspec(naked)
#define AS_HEX(y) 0x##y
#endif

Pretty smart of them ;) :)

Bye,
Skybuck.


Skybuck Flying

unread,
Mar 14, 2008, 3:04:52 PM3/14/08
to
Hmm for some reason it seems the tiger stuff is not included in the dll ? Or
is it just me ?

I might have to rip it out and test it seperately in c++ before I might
attempt a conversion to Delphi ;)

I would prefer to know it's performance before I even attempt such a thing
;)

To see if it's worth it or not ;)

Bye,
Skybuck.


Skybuck Flying

unread,
Mar 14, 2008, 3:25:36 PM3/14/08
to
The Cryptest.exe has many benchmarks in it.

It asks the frequency of the cpu in gigahurts.

That's something I don't really understand.

Why does it ask it ?

My guess is it probably uses it to calculate bytes per clock or something ?

Kinda strange that it uses such an unreliable source of information ?

(Don't trust the user?)

Oh well.

Also I am not sure what to enter for my dual core cpu.

Should I enter the ghz for a single core or added together for both ?

Also what does MiB exactly mean ?

Maybe Million of bytes per sec ?

I did a quick test with 1 second, and 2 ghz.

The test outputs an html encoding for a table. I pasted it into frontpage...
it seems to look ok... though at the end the table seemed screwed up...

I don't trust this table... what if the markup was fucked up ? ;)

Anyway... I have 30 windows open or so... in the background they don't do
much... the cpu x2 3800+ is in power saving mode... minimul power
management.

The speed for tiger was reported as:

106 MiB/sec.

Maybe later I do a better/longer test... under better circumstances...

Or maybe I simply ripp it out.

The first thing I am gonna do now... is investigate the benchmark code for
the tiger section.

And then I try to code something similiar for the Delphi implementation of
the tiger and compare it against each other.

Maybe later I tried a x64 version of Delphi's code with free pascal cross
compiler. Interesting to see if mmx is slower or faster than x64 ;)

Bye,
Skybuck.


Skybuck Flying

unread,
Mar 14, 2008, 3:27:04 PM3/14/08
to
Yes and I also give complements to vistual studio and the people who made
this library.

In the past C/C++ code frequently would not work when downloaded from the
internet.

But this works just fine !

Nice ! ;)

Bye,
Skybuck =D


Skybuck Flying

unread,
Mar 14, 2008, 3:34:46 PM3/14/08
to
Also take note:

I had to move the *.exe to the main folder otherwise it can't find the
usage.dat thingies and such.

I will modify the path of the build so it goes to main folder.

I wanna set a breakpoint on the tiger hash to see exactly what it does.

The benchmark code seems a little bit sloppy if I found the correct one:

while (timeTaken < 2.0/3*timeTotal);

OutputResultBytes(name, double(blocks) * BUF_SIZE, timeTaken);

I don't understand the formula above... I think it's sloppy coded, and might
make it more difficult to compare against
other external benchmarks.

Bytes should always be divided by total time taken for accurate
measurements.

Bye,
Skybuck.


Skybuck Flying

unread,
Mar 14, 2008, 3:56:03 PM3/14/08
to
Ok, I was wondering about that, and now it turns out I spoke to soon.

It seems they do divide it by the time:

It happens in the output function

double mbs = length / timeTaken / (1024*1024);

Finally they use a little benchmarking trick to prevent spending accessive
cpu time in the "timing functions".

By multiplieing the number of tried blocks by 2 each loop, and then looping
until i is bigger etc.

The only thing I might not be able to reproduce is the random number
generator they used.

Maybe the filling of the buffer plays a roll... me not sure...

Well I could simply reproduce it... by output the buffer in text mode and
simply copy and pasting it.. or maybe i need to investigate the rng.. for
now I will simply use delphi's random number generator and perform a
benchmarking comparision.. this will be interesting.

Ofcourse their benchmark has a little c++ inheritance/template overhead...
but delphi will have a little bit of inheritance overhead as well.

Still if the asm is any significantly faster that shouldn't matter much.

Bye,
Skybuck.


Skybuck Flying

unread,
Mar 14, 2008, 4:40:59 PM3/14/08
to
Hmmmmmmmmmmmmmmmmmmmmmmmmmm.

The C/C++/ASM Library (Crypto++) is quite impressive:

The tiger implementation achieves 112 megabyte per second on AMD X2 3800+.
(With lots of windows open in release mode)

The Delphi library (DCPCrypt 2) performs as follows:

The tiger implementation achieves 45.45 megabyte per second on AMD X2 3800+.
(With lots of windows open in release mode)

Compilers used:
Delphi 2007.
Visual Studio 2005. (Project default opened in VS 2005 while VS 2008 was
installed (?))

I am pretty sure this big difference is not because of other overhead
factors or so... I had a little bit of doubts about the high performance
timer I used.. so I replaced it with the timer the C code uses but it makes
no difference... also I inspected the ticks per second variable in the C
version and it was set to 1000 I don't know how they acquire it nor do I
care at this point. I got what I wanted... some nice benchmarking numbers.

This means the C/C++/ASM version is roughly 2.5 times faster than the Delphi
version which is very interesting.

Finally I wonder how free pascal compiler in X64 mode will perform.

So I will probably try that next... but no garantuees... maybe it won't
compile.

Here is the same kind of benchmark in Delphi as used in the C/C++ code:

It uses a slight modified DCPCrypt Library... only the component parameter
for the create/constructor was removed ;)

// *** Begin of Benchmark Program ***


program Project1;

{$APPTYPE CONSOLE}

uses
SysUtils,
Windows,
DCPcrypt_version_201,
DCPtiger_version_201;

var
start_tics : int64;

(*

/***
*clock_t clock() - Return the processor time used by this process.
*
*Purpose:
* This routine calculates how much time the calling process
* has used. At startup time, startup calls __inittime which stores
* the initial time. The clock routine calculates the difference
* between the current time and the initial time.
*
* Clock must reference _cinitime so that _cinitim.asm gets linked in.
* That routine, in turn, puts __inittime in the startup initialization
* routine table.
*
*Entry:
* No parameters.
* itime is a static structure of type timeb.
*
*Exit:
* If successful, clock returns the number of CLK_TCKs (milliseconds)
* that have elapsed. If unsuccessful, clock returns -1.
*
*Exceptions:
* None.
*
*******************************************************************************/
*)

function clock : int64;
var
current_tics : int64;
ct : FILETIME;
begin

GetSystemTimeAsFileTime( &ct );

current_tics := int64(ct.dwLowDateTime) + (int64(ct.dwHighDateTime) shl
32);

// calculate the elapsed number of 100 nanosecond units
current_tics := current_tics - start_tics;

// return number of elapsed milliseconds
result := int64(current_tics div 10000);
end;

(*
/***
*int __inittime() - Initialize the time location
*
*Purpose:
* This routine stores the time of the process startup.
* It is only linked in if the user issues a clock runtime call.
*
*Entry:
* No arguments.
*
*Exit:
* Returns 0 to indicate no error.
*
*Exceptions:
* None.
*
*******************************************************************************/
*)

function inittime : integer;
var
st : FILETIME;

begin
GetSystemTimeAsFileTime( st );

start_tics := int64(st.dwLowDateTime) + (int64(st.dwHighDateTime) shl 32);

result := 0;
end;

procedure Main;
var
vHash : TDCP_hash;
vBuffer : packed array[0..2047] of byte; // same size as the cryptest.
vBufferSize : integer;
// vTick1 : int64;
// vTick2 : int64;
vStart : int64;
vTicksPerSecond : int64;
vIndex : integer;
vBlocks : integer;

vTimeTaken : double;
vTimeTotal : double;
begin
vBufferSize := 2048;

QueryPerformanceFrequency( vTicksPerSecond );

vHash := TDCP_Tiger.Create;
vHash.Init;

vIndex := 0;
vBlocks := 1;
vTimeTotal := 10;

// QueryPerformanceCounter(vTick1);

vTicksPerSecond := 1000;

vStart := Clock;

repeat

vBlocks := vBlocks * 2;
while vIndex < vBlocks do
begin
vHash.Update( vBuffer, vBufferSize );
vIndex := vIndex + 1;
end;

// QueryPerformanceCounter(vTick2);

// vTimeTaken := (vTick2 - vTick1) / vTicksPerSecond;


vTimeTaken := (Clock - vStart) / vTicksPerSecond;


until not (vTimeTaken < 2.0 / 3* vTimeTotal);

vHash.Free;

writeln('MiB/Sec: ', ( ( (vBlocks * vBufferSize) / vTimeTaken ) /
(1024*1024) ) :16:2 );
end;

begin
try
inittime;
Main;
except
on E:Exception do
Writeln(E.Classname, ': ', E.Message);
end;
readln;
end.

// *** End of Benchmark Program ***

Bye,
Skybuck.


Skybuck Flying

unread,
Mar 14, 2008, 4:45:49 PM3/14/08
to
Hmm ok,

I see I forgot to initialize the buffer in the delphi code/benchmark.

So here is the code for that:

BufferSize := ...;

// paste it here
for vIndex := 0 to 2048-1 do
begin
vBuffer[vIndex] := Random(256);
end;

It doesn't matter much though.

This makes the Delphi implementation 1 megabyte/sec slower for a total speed
of 44 megabyte/sec.

Bye,
Skybuck.


Skybuck Flying

unread,
Mar 14, 2008, 5:12:48 PM3/14/08
to
Hello,

I just compiled exactly the same project on free pascal cross compiler Beta
2.1.4 to X64/AMD64 instruction set.

I think I compiled it just fine, since it only compiles to X64 if I am not
mistaken.

I even turned on optimizations with -O3

Look at the fricking results:

Y:\FreePascal\Benchmarks\BenchmarkTigerHash\Version001>C:\Tools\Compilers\FreePa
scal\2.1.4BetaForWin64\bin\i386-win32\ppcrossx64 -Mdelphi -O3 Project1.dpr

Y:\FreePascal\Benchmarks\BenchmarkTigerHash\Version001>Project1
MiB/Sec: 34.64

When compiling/linking/building to X64 it becomes even slower ?!

At least 10 megabyte/sec slower !

Absolutely SUX.

Bye,
Skybuck.


Skybuck Flying

unread,
Mar 14, 2008, 5:19:38 PM3/14/08
to
Little conversion error:

At this line:

> function clock : int64;
> var
> current_tics : int64;
> ct : FILETIME;
> begin
>
> GetSystemTimeAsFileTime( &ct );

Should be:

GetSystemTimeAsFileTime( ct );

'&' = illegal character(?), but Delphi still compiled just fine with it
hehehehehe LOL crappy bastard=compiler :)

Bye,
Skybuck.


Skybuck Flying

unread,
Mar 14, 2008, 6:16:04 PM3/14/08
to
Well it's legal in Delphi, it's new feature to declare and work with
otherwise disallowed/inuse names like:

var
&Begin : string;

begin

&Begin := 'Blablablabla';

end;

This will now compile in Delphi, but not in free pascal ;)

Bye,
Skybuck.


0 new messages