http://homepage1.nifty.com/herumi/soft/xbyak_e.html
For example
---
#include <stdio.h>
#include "xbyak.h"
struct AddFunc : public Xbyak::CodeGenerator {
AddFunc(int y)
{
mov(eax, ptr[esp+4]);
add(eax, y);
ret();
}
};
int main()
{
AddFunc a(3);
int (*add3)(int) = (int (*)(int))a.getCode();
printf("3 + 2 = %d\n", add3(2));
}
---
In the above sample, add(eax, y) function generates add eax, 3 when
this program is runnning(y = 3).
The other sample is a fast quantization for JPEG(http://
homepage1.nifty.com/herumi/soft/xbyak/quantize.cpp).
This sample generates fast division in runtime.
Please try this if you are interested in Xbyak.
Thank you,
herumi
Nice. Do you support properly no-execute/DEP or will it page fault if
DEP is enabled in Windows?
Alex
Yes. Xbyak calls VirtualProtect(in Windows) or mprotect(in Linux),
then it will run well even if DEP is enabled.
Nice. A few of my paused hobby projects have been waiting
for a tool like this - you may have resurrected them! (If
I can remember what they were!)
I notice that for code alignment you use a sequence of
individual nops. Might the following be useful?
(You might need to ensure that they're only enabled for
the right generations of processors, of course.)
xbyak/xbyak_mnemonic.h:
void nop2() { db(0x66); db(0x90); }
void nop3() { db(0x66h); db(0x66); db(0x90); }
xbyak/xbyak.h:
void align(int x = 8)
{
if (x != 4 || x != 8 || x != 16 || x != 32) throw ERR_BAD_ALIGN;
int d;
while ((d=x-(GetPtrDist(getCurr())%x)) != x) {
if(d>3) { nop3(); }
else if(d==2) { nop2(); }
else { nop(); }
}
}
Of course, there are alternative multi-byte nops which involve
register read/don't-modify/write operations. Perhaps they
could be used for longer sequences or for older architectures.
On that note - does anyone know at what point an unconditional
short jmp forward becomes faster than a sequence of NOPs for
various popular arch's?
Keep up the great work, and please keep us updated here on clax
when there are important revisions.
Phil
--
Dear aunt, let's set so double the killer delete select all.
-- Microsoft voice recognition live demonstration
>I notice that for code alignment you use a sequence of
>individual nops. Might the following be useful?
Thank you for your advice.
I had tried to add the optimized nop when I was implementing align()
before.
But the best way to optimize nop is different according to type of
CPU,
then detection of the type is necessary.
Though I can write the code, I don't think the function should be in
xbyak.h.
I intend to make xbyak_util.h(for example) and add the function in the
header.
cf.
"Software Optimization Guide for AMD64 Processors"
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF
4.12 Code Padding with Operand-Size Override and NOP
90
66 90
66 66 90
66 66 66 90
66 66 90 66 90
"Intel 64 and IA-32 Architectures Optimization Reference Manual"
3.5.1.8 Using NOPs
http://download.intel.com/design/PentiumII/manuals/24512701.pdf
90 ; xchg, eax, eax
89 C0 ; mov eax, eax
8D 40 00 ; lea eax, [eax + 0x00]
90 8D 40 00 ; nop / lea eax, [eax + 0x00]
...
---
herumi
These should work, but some of them are multiple instructions.
> "Intel 64 and IA-32 Architectures Optimization Reference Manual"
> 3.5.1.8 Using NOPshttp://download.intel.com/design/PentiumII/manuals/24512701.pdf
>
> 90 ; xchg, eax, eax
This is a true NOP.
> 89 C0 ; mov eax, eax
^^^ this one isn't a true NOP in 64-bit mode because of extension to
64 bits.
> 8D 40 00 ; lea eax, [eax + 0x00]
Nor is this one for the same reason.
As far as I can tell by looking at both intel and AMD processor
manuals, 0F 1F + ModRM is a common true multi-byte NOP. It's
availability depends on CPUID, though.
Alex
beside [...]
> On that note - does anyone know at what point an unconditional
> short jmp forward becomes faster than a sequence of NOPs for
> various popular arch's?
It depends on the CPU's code-prefetch size and decoder capability.
AMD K7/K8 skips up to three NOPS in zero time and run up to 15 Nops
within one cycle if they are within one code-fetch (after any jmp/call/..)
and if this code-fetch wont cross cache bounds.
An 'EB/E9' jump, may take just one cycle or less, but at the jump
target we have to add a full fetch cycle including pre-decode then,
means 2..5 clock-cycles depending on type, size and count of
instructions found there.
During code evaluation, I always have NOPs in the trailing end
of cache-lines and it doesn't affect timing if I jump over half
a line (32 nops), but it speeds things up if I jump over a full
cache-line (usually to the next 64-byte bound).
Don't know how Intel-CPUs behave on this.
__
wolfgang