x86/x64 JIT assembler for C++

herumi

unread,

Apr 29, 2008, 9:27:11 PM4/29/08

to

Hello, I have released a JIT assembler for x86/x64 for C++ named as
Xbyak.
You can write x86/x64 mnemonics by writing C++ member function.
This library supports Windows(32bit, 64bit), Linux(32bit, 64bit),
Intel Mac.

http://homepage1.nifty.com/herumi/soft/xbyak_e.html

For example
---
#include <stdio.h>
#include "xbyak.h"

struct AddFunc : public Xbyak::CodeGenerator {
AddFunc(int y)
{
mov(eax, ptr[esp+4]);
add(eax, y);
ret();
}
};

int main()
{
AddFunc a(3);
int (*add3)(int) = (int (*)(int))a.getCode();
printf("3 + 2 = %d\n", add3(2));
}
---
In the above sample, add(eax, y) function generates add eax, 3 when
this program is runnning(y = 3).
The other sample is a fast quantization for JPEG(http://
homepage1.nifty.com/herumi/soft/xbyak/quantize.cpp).
This sample generates fast division in runtime.
Please try this if you are interested in Xbyak.

Thank you,
herumi

Alexei A. Frounze

unread,

Apr 30, 2008, 1:58:05 AM4/30/08

to

Nice. Do you support properly no-execute/DEP or will it page fault if
DEP is enabled in Windows?

Alex

herumi

unread,

Apr 30, 2008, 4:11:48 AM4/30/08

to

Hi Alex,

> Nice. Do you support properly no-execute/DEP or will it page fault if
> DEP is enabled in Windows?

Yes. Xbyak calls VirtualProtect(in Windows) or mprotect(in Linux),
then it will run well even if DEP is enabled.

Phil Carmody

unread,

Apr 30, 2008, 6:13:00 AM4/30/08

to

herumi <spam...@crayne.org> writes:
> Hello, I have released a JIT assembler for x86/x64 for C++ named as
> Xbyak.
> You can write x86/x64 mnemonics by writing C++ member function.
> This library supports Windows(32bit, 64bit), Linux(32bit, 64bit),
> Intel Mac.
>
> http://homepage1.nifty.com/herumi/soft/xbyak_e.html

Nice. A few of my paused hobby projects have been waiting
for a tool like this - you may have resurrected them! (If
I can remember what they were!)

I notice that for code alignment you use a sequence of
individual nops. Might the following be useful?
(You might need to ensure that they're only enabled for
the right generations of processors, of course.)

xbyak/xbyak_mnemonic.h:

void nop2() { db(0x66); db(0x90); }
void nop3() { db(0x66h); db(0x66); db(0x90); }

xbyak/xbyak.h:

void align(int x = 8)
{
if (x != 4 || x != 8 || x != 16 || x != 32) throw ERR_BAD_ALIGN;
int d;
while ((d=x-(GetPtrDist(getCurr())%x)) != x) {
if(d>3) { nop3(); }
else if(d==2) { nop2(); }
else { nop(); }
}
}

Of course, there are alternative multi-byte nops which involve
register read/don't-modify/write operations. Perhaps they
could be used for longer sequences or for older architectures.

On that note - does anyone know at what point an unconditional
short jmp forward becomes faster than a sequence of NOPs for
various popular arch's?

Keep up the great work, and please keep us updated here on clax
when there are important revisions.

Phil
--
Dear aunt, let's set so double the killer delete select all.
-- Microsoft voice recognition live demonstration

herumi

unread,

Apr 30, 2008, 9:16:58 PM4/30/08

to

Hi Phil,

>I notice that for code alignment you use a sequence of
>individual nops. Might the following be useful?

Thank you for your advice.
I had tried to add the optimized nop when I was implementing align()
before.

But the best way to optimize nop is different according to type of
CPU,
then detection of the type is necessary.

Though I can write the code, I don't think the function should be in
xbyak.h.
I intend to make xbyak_util.h(for example) and add the function in the
header.

cf.
"Software Optimization Guide for AMD64 Processors"
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF
4.12 Code Padding with Operand-Size Override and NOP

90
66 90
66 66 90
66 66 66 90
66 66 90 66 90

"Intel 64 and IA-32 Architectures Optimization Reference Manual"
3.5.1.8 Using NOPs
http://download.intel.com/design/PentiumII/manuals/24512701.pdf

90 ; xchg, eax, eax
89 C0 ; mov eax, eax
8D 40 00 ; lea eax, [eax + 0x00]
90 8D 40 00 ; nop / lea eax, [eax + 0x00]
...

---
herumi

Alexei A. Frounze

unread,

Apr 30, 2008, 11:16:01 PM4/30/08

to

On Apr 30, 6:16 pm, herumi <spamt...@crayne.org> wrote:
> Hi Phil,
>
> >I notice that for code alignment you use a sequence of
> >individual nops. Might the following be useful?
>
> Thank you for your advice.
> I had tried to add the optimized nop when I was implementing align()
> before.
>
> But the best way to optimize nop is different according to type of
> CPU,
> then detection of the type is necessary.
>
> Though I can write the code, I don't think the function should be in
> xbyak.h.
> I intend to make xbyak_util.h(for example) and add the function in the
> header.
>
> cf.

> "Software Optimization Guide for AMD64 Processors"http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_do...

> 4.12 Code Padding with Operand-Size Override and NOP
>
> 90
> 66 90
> 66 66 90
> 66 66 66 90
> 66 66 90 66 90

These should work, but some of them are multiple instructions.

> "Intel 64 and IA-32 Architectures Optimization Reference Manual"

> 3.5.1.8 Using NOPshttp://download.intel.com/design/PentiumII/manuals/24512701.pdf
>
> 90 ; xchg, eax, eax

This is a true NOP.

> 89 C0 ; mov eax, eax

^^^ this one isn't a true NOP in 64-bit mode because of extension to
64 bits.

> 8D 40 00 ; lea eax, [eax + 0x00]

Nor is this one for the same reason.

As far as I can tell by looking at both intel and AMD processor
manuals, 0F 1F + ModRM is a common true multi-byte NOP. It's
availability depends on CPUID, though.

Alex

Wolfgang Kern

unread,

May 1, 2008, 4:46:05 AM5/1/08

to

Phil Carmody asked:

beside [...]

> On that note - does anyone know at what point an unconditional
> short jmp forward becomes faster than a sequence of NOPs for
> various popular arch's?

It depends on the CPU's code-prefetch size and decoder capability.

AMD K7/K8 skips up to three NOPS in zero time and run up to 15 Nops
within one cycle if they are within one code-fetch (after any jmp/call/..)
and if this code-fetch wont cross cache bounds.

An 'EB/E9' jump, may take just one cycle or less, but at the jump
target we have to add a full fetch cycle including pre-decode then,
means 2..5 clock-cycles depending on type, size and count of
instructions found there.

During code evaluation, I always have NOPs in the trailing end
of cache-lines and it doesn't affect timing if I jump over half
a line (32 nops), but it speeds things up if I jump over a full
cache-line (usually to the next 64-byte bound).

Don't know how Intel-CPUs behave on this.

__
wolfgang