assembler speed...

cr88192

unread,

Mar 25, 2010, 11:18:15 AM3/25/10

to

well, this was a recent argument on comp.compilers, but I figured it may
make some sense in a "freer" context.

basically, it is the question of whether or not a textual assembler is fast
enough for use in a JIT context (I believe it is, and that one can benefit
notably from using textual ASM here).

in my case, it works ok, but then I realized that I push through it
relatively low volumes of ASM, and I am left to wonder about the higher
volume cases.

so, some tests:
basically, I have tried assembling a chunk of text over and over again (in a
loop) and figuring out how quickly it was pushing through ASM.

it keeps track of the time, and runs for about 10s, as well as how many
times the loop has run, and from this can figure how quickly ASM is being
processed.

the dynamic linker is currently disabled in these tests, as this part proves
problematic to benchmark due to technical reasons (endlessly re-linking the
same code into the running image doesn't turn out well, as fairly quickly
the thing will crash).

(I would need to figure out a way to hack-disable part of the dynamic linker
to use it in benchmarks).

initially, I found that my assembler was not performing terribly well, and
the profiler showed that most of the time was going into zeroing memory. I
fixed this, partly both by reducing the size of some buffers, and in a few
cases disabling the 'memset' calls.

then I went on a search, trying some to micro-optimize the preprocessor, and
also finding and fixing a few bugs (resulting from a few recent additions to
the preprocessor functionality).

at this point, it was pulling off around 1MB/s (so, 1MB of ASM per second).

I then noted that most of the time was going into my case-insensitive
compare function, which is a bit slower than the case-sensitive compare
function (strcmp).

doing a little fiddling in the ASM parser reduced its weight, and got the
speed to about 1.5MB/s.

as such, time is still mostly used by the case-insensitive compare, and also
the function to read tokens.

I am left to wonder if this is "fast enough".

I am left to wonder if I should add options for a no-preprocessor +
case-sensitive mode (opcodes/registers/... would be necessarily lower-case),
...

but, really, I don't know how fast people feel is needed.

but, in my case, I will still use it, since the other major options:
having codegens hand-craft raw machine-code;
having to create and use an API to emit opcodes;
...

don't really seem all that great either.

and, as well, I guess the volumes of ASM I assemble are low enough that it
has not been much of an issue thus far (I tend not to endlessly re-assemble
all of my libraries, as most loadable modules are in HLL's, and binary
object-caching tends to be used instead of endless recompilation...).

for most fragmentary code, such as resulting from eval or from
special-purpose thunks, the total volume of ASM tends to remain fairly low
(most are periodic and not that large).

likely, if it did become that much of an issue, there would be bigger issues
at play...

or such...

Marco van de Voort

unread,

Mar 25, 2010, 12:05:33 PM3/25/10

to

On 2010-03-25, cr88192 <cr8...@hotmail.com> wrote:
> well, this was a recent argument on comp.compilers, but I figured it may
> make some sense in a "freer" context.
>
> basically, it is the question of whether or not a textual assembler is fast
> enough for use in a JIT context (I believe it is, and that one can benefit
> notably from using textual ASM here).

When we replaced the assembler with an internal one, it was 40% faster on
Linux/FreeBSD, and more than 100% on windows. (overall build time)

We explained the difference due to slower I/O and, mainly, slower .exe
startup/shutdown time.

Branimir Maksimovic

unread,

Mar 25, 2010, 1:34:41 PM3/25/10

to

On Thu, 25 Mar 2010 08:18:15 -0700
"cr88192" <cr8...@hotmail.com> wrote:

>
> initially, I found that my assembler was not performing terribly
> well, and the profiler showed that most of the time was going into
> zeroing memory. I fixed this, partly both by reducing the size of
> some buffers, and in a few cases disabling the 'memset' calls.

This is fasm time in compilin it's own source for all 4
platforms it supports on my machine

bmaxa@maxa:~/fasm/source$ time fasm DOS/fasm.asm | fasm Linux/fasm.asm
| fasm libc/fasm.asm | fasm Win32/fasm.asm
flat assembler version 1.68 (16384 kilobytes memory) 4 passes, 83456
bytes.

real 0m0.060s
user 0m0.080s
sys 0m0.000s
bmaxa@maxa:~/fasm/source$ find . -name 'fasm*' -exec ls -l {}
\;-rw-r--r-- 1 bmaxa bmaxa 99982 2010-03-25 18:21 ./libc/fasm.o
-rw-rw-r-- 1 bmaxa bmaxa 4874 2009-07-06 15:44 ./libc/fasm.asm
-rw-r--r-- 1 bmaxa bmaxa 77635 2010-03-25 18:21 ./DOS/fasm.exe
-rw-rw-r-- 1 bmaxa bmaxa 5260 2009-07-06 15:44 ./DOS/fasm.asm
-rw-r--r-- 1 bmaxa bmaxa 83456 2010-03-25 18:21 ./Win32/fasm.exe
-rw-rw-r-- 1 bmaxa bmaxa 6160 2009-07-06 15:44 ./Win32/fasm.asm
-rwxr-xr-x 1 bmaxa bmaxa 75331 2010-03-25 18:21 ./Linux/fasm -rw-rw-r--
1 bmaxa bmaxa 4694 2009-07-06 15:44 ./Linux/fasm.asm
bmaxa@maxa:~/fasm/source$
bmaxa@maxa:~/fasm/source$ find . -name '*.inc' -exec ls -l {} \;
-rw-rw-r-- 1 bmaxa bmaxa 5424 2009-07-06 15:44 ./libc/system.inc
-rw-rw-r-- 1 bmaxa bmaxa 50682 2009-07-06 15:44 ./expressi.inc
-rw-rw-r-- 1 bmaxa bmaxa 138351 2009-07-06 15:44 ./x86_64.inc
-rw-rw-r-- 1 bmaxa bmaxa 7779 2009-07-06 15:44 ./DOS/system.inc
-rw-rw-r-- 1 bmaxa bmaxa 1995 2009-07-06 15:44 ./DOS/sysdpmi.inc
-rw-rw-r-- 1 bmaxa bmaxa 10419 2009-07-06 15:44 ./DOS/modes.inc
-rw-rw-r-- 1 bmaxa bmaxa 24541 2009-07-06 15:44 ./parser.inc
-rw-rw-r-- 1 bmaxa bmaxa 37936 2009-07-06 15:44 ./assemble.inc
-rw-rw-r-- 1 bmaxa bmaxa 7916 2009-07-06 15:44 ./Win32/system.inc
-rw-rw-r-- 1 bmaxa bmaxa 6290 2009-07-06 15:44 ./Linux/system.inc
-rw-rw-r-- 1 bmaxa bmaxa 46363 2009-07-06 15:44 ./preproce.inc
-rw-rw-r-- 1 bmaxa bmaxa 3860 2009-07-06 15:44 ./errors.inc
-rw-rw-r-- 1 bmaxa bmaxa 1805 2009-07-06 15:44 ./version.inc
-rw-rw-r-- 1 bmaxa bmaxa 82747 2009-07-06 15:44 ./formats.inc
-rw-rw-r-- 1 bmaxa bmaxa 2404 2009-07-06 15:44 ./messages.inc
-rw-rw-r-- 1 bmaxa bmaxa 48970 2009-07-06 15:44 ./tables.inc
-rw-rw-r-- 1 bmaxa bmaxa 2267 2009-07-06 15:44 ./variable.inc

Greets!

--
http://maxa.homedns.org/

Sometimes online sometimes not

Robbert Haarman

unread,

Mar 25, 2010, 2:24:59 PM3/25/10

to

Hi CR,

On Thu, Mar 25, 2010 at 08:18:15AM -0700, cr88192 wrote:
>
> basically, it is the question of whether or not a textual assembler is fast
> enough for use in a JIT context (I believe it is, and that one can benefit
> notably from using textual ASM here).

I would imagine that it depends on what you consider "fast enough".

> in my case, it works ok, but then I realized that I push through it
> relatively low volumes of ASM, and I am left to wonder about the higher
> volume cases.

Right. If you handle only low volumes, many solutions tend to be
"fast enough".

In my experience, assemblers (that read assembly code and produce
machine code) tend to be quite fast. It seems to me that many compilers
spend more time processing the source language into (optimized) assembly
than the assembler spends turning the resulting assembly code into
machine code.

On the other hand, parsing text can be quite time consuming. In programs
I have profiled, it is not uncommon to find that they spend most of their
time parsing their input. Although I haven't profiled any assemblers, I
could easily imagine that parsing and recognizing opcodes takes up most
of their time.

To answer all the questions here, it would probably be a good idea to
first come up with a definition of "fast enough", and then, if you find
your program isn't fast enough by this definition, to profile it to figure
out where it is spending most of its time.

Another question is why you would be going through assembly code at all.
What benefit does it provide, compared to, for example, generating machine
code directly? Surely, if speed is a concern, you could benefit from
cutting out the assembler altogether.

Kind regards,

Bob

Maxim S. Shatskih

unread,

Mar 25, 2010, 3:15:26 PM3/25/10

to

> basically, it is the question of whether or not a textual assembler is fast
> enough for use in a JIT context

What is the value of this?

The values of JIT:

a) platform independent binaries, the platform-dependency occurs only on load and not on build.
b) mandatory, really mandatory, without the chances to escape by the malicious tools use, things like exception handling, attribute-based code access rights, and garbage collection.

Both a) and b) are only achievable if some IL will be used pre-JIT, not the real assembler.

IL is a) platform-independent b) just has no means to bypass security, exception frames, or to do leakable memory allocations.

Real ASM is not such.

You can, though, invent the textual IL and binarize it on load. But what is the value of this, compared to IL binarized at build?

--
Maxim S. Shatskih
Windows DDK MVP
ma...@storagecraft.com
http://www.storagecraft.com

cr88192

unread,

Mar 25, 2010, 3:22:02 PM3/25/10

to

"Robbert Haarman" <comp.la...@inglorion.net> wrote in message
news:2010032518...@yoda.inglorion.net...

> Hi CR,
>
> On Thu, Mar 25, 2010 at 08:18:15AM -0700, cr88192 wrote:
>>
>> basically, it is the question of whether or not a textual assembler is
>> fast
>> enough for use in a JIT context (I believe it is, and that one can
>> benefit
>> notably from using textual ASM here).
>
> I would imagine that it depends on what you consider "fast enough".

agreed.

it has been fast enough for my uses, but others claim that JIT requires
things like directly crafting the sequences in the codegen.

however, an assembler has many advantages:
it avoids the tedium of endlessly re-crafting the same basic opcodes;
it can automatically figure out how wide jumps need to be;
...

>> in my case, it works ok, but then I realized that I push through it
>> relatively low volumes of ASM, and I am left to wonder about the higher
>> volume cases.
>
> Right. If you handle only low volumes, many solutions tend to be
> "fast enough".
>
> In my experience, assemblers (that read assembly code and produce
> machine code) tend to be quite fast. It seems to me that many compilers
> spend more time processing the source language into (optimized) assembly
> than the assembler spends turning the resulting assembly code into
> machine code.
>

agreed.

the vast majority of the time in my compiler tends to go into higher-level
operations:
parsing C code;
working with AST's;
processing the IL and running the codegen;
...

often to produce only a few kB to be run through the assembler at a time.

FWIW, my assembler is MUCH faster than my C upper-end, since it can assemble
about 1.5MB of ASM per second, rather than more about 250kB per second
(which is a common case for the C frontend given the volumes of crap it can
pull in from headers...).

> On the other hand, parsing text can be quite time consuming. In programs
> I have profiled, it is not uncommon to find that they spend most of their
> time parsing their input. Although I haven't profiled any assemblers, I
> could easily imagine that parsing and recognizing opcodes takes up most
> of their time.
>

opcode lookup uses a hash-table, and doesn't really show up in the profiler.
much more of the time at present goes into recognizing and reading off
tokens.

> To answer all the questions here, it would probably be a good idea to
> first come up with a definition of "fast enough", and then, if you find
> your program isn't fast enough by this definition, to profile it to figure
> out where it is spending most of its time.

well, I know about my programs.
the question is, what about everyone else?...

so, the goal is more of a general answer, rather than something simply
relevant to my projects...

> Another question is why you would be going through assembly code at all.
> What benefit does it provide, compared to, for example, generating machine
> code directly? Surely, if speed is a concern, you could benefit from
> cutting out the assembler altogether.

producing it directly is IMO a much less nice option.
it is barely even a workable strategy with typical bytecode formats, and
with x86 machine code would probably suck...

admittedly, if really needed I could add a binary-ASM API to my assembler
(would allow using function calls to generate ASM), but this is likely to be
much less nice than using a textual interface, and could not likely optimize
jumps (likely short jumps would need to be explicit).

OTOH, another assembler could be written, but there is not likely a whole
lot of point for this at present.

...

cr88192

unread,

Mar 25, 2010, 3:38:44 PM3/25/10

to

"Maxim S. Shatskih" <ma...@storagecraft.com.no.spam> wrote in message
news:hogcod$qkm$1...@news.mtu.ru...

> basically, it is the question of whether or not a textual assembler is
> fast
> enough for use in a JIT context

<--

What is the value of this?

The values of JIT:

a) platform independent binaries, the platform-dependency occurs only on
load and not on build.
b) mandatory, really mandatory, without the chances to escape by the
malicious tools use, things like exception handling, attribute-based code
access rights, and garbage collection.

Both a) and b) are only achievable if some IL will be used pre-JIT, not the
real assembler.

IL is a) platform-independent b) just has no means to bypass security,
exception frames, or to do leakable memory allocations.

Real ASM is not such.

You can, though, invent the textual IL and binarize it on load. But what is
the value of this, compared to IL binarized at build?
-->

in question here is using ASM within the JIT, IOW, in the post-IL stages.

for example, a person loads their bytecode, and has the option of how to go
about JIT'ing it.

admitted, for large binary-image IL formats (such as MSIL/CIL), using ASM
could have a performance impact.

IL is loaded, run through codegen, and then:
A. textual ASM is produced, and run through an assembler, and linked into
the running image;
B. machine code is produced in-place.

A. has the potential of a much cleaner implementation, ... but the cost that
it is not as fast and, on average, there is more code (both in the codegen,
+ the code for the assembler);

B. is very fast, and needs relatively little code, but IMO often leads to a
much less clean and much less reusable implementation (likely the codegen
will depend far more on architecture specific details than had it simply
used an assembler, since the assembler would manage many low-level ISA
details).

the few times I had used B (mostly because early on I couldn't use the
assembler from within the assembler), it was a very unpleasant experience,
typically involving going back and forth over the Intel docs to craft the
particular instruction sequences (and also for each CPU mode).

I have generally since used almost entirely dynamically-produced textual
ASM, since this is, FWIW, a much nicer way to do it.

similarly, with ASM it is also much more obvious which instructions are
being used, taking out the problem of having to recognize many of the
opcodes by byte sequence, ...

...

Maxim S. Shatskih

unread,

Mar 25, 2010, 3:52:48 PM3/25/10

to

> IL is loaded, run through codegen, and then:
> A. textual ASM is produced, and run through an assembler, and linked into
> the running image;
> B. machine code is produced in-place.

So, ASM is the intermediate form of IL->machine code translator.

And why this form? maybe there are other ways, more effective?

> A. has the potential of a much cleaner implementation,

Matter of taste.

>... but the cost that it is not as fast and, on average, there is more code (both in the codegen,
> + the code for the assembler);

Perf loss, code complexity loss - all for the matter of taste.

Robbert Haarman

unread,

Mar 25, 2010, 4:19:11 PM3/25/10

to

I wouldn't worry about that too much, unless your code generator is the
most interesting part of what you are making. First, make it work. Then
you can think about making it better - assuming you don't have more
interesting things to tackle.

> > Another question is why you would be going through assembly code at all.
> > What benefit does it provide, compared to, for example, generating machine
> > code directly? Surely, if speed is a concern, you could benefit from
> > cutting out the assembler altogether.
>
> producing it directly is IMO a much less nice option.
> it is barely even a workable strategy with typical bytecode formats, and
> with x86 machine code would probably suck...

I don't really see that. The way I see it, most of the work is in getting
from what you have (presumably some instruction-set-independent source code
or intermediate representation) to the instructions of your target platform.
Once you are there, I think emitting these instructions as binary or as
text doesn't make too much of a difference.

I've written code to emit binary instructions for various targets, and,
in my experience, it's not very hard. Sure, x86's ModRM is a bit tricky,
but you write that once and then it will just sit there, doing its job.
In the grand scheme of writing a compiler, this isn't a big deal.

Generating the instructions in binary form right away also makes it very
easy to know exactly where your code ends up and what its size is, which
may actually make it _easier_ to patch addresses into your code and
make decisions about short vs. long jumps.

> admittedly, if really needed I could add a binary-ASM API to my assembler
> (would allow using function calls to generate ASM), but this is likely to be
> much less nice than using a textual interface, and could not likely optimize
> jumps (likely short jumps would need to be explicit).

My experience is that how nice the API is depends very much on the
language you express it in. For example, I've tried to come up with a nice
API for instruction generation in C, but never got it to the point where
I was really happy with it. In a language which lets you write out a
data structure in-line, preferably with automatic memory management and
namespaces, this is much easier.

It's the difference between, for example:

n += cg_x86_emit_reg32_imm8_instr(code + n,
sizeof(code) - n,
CG_X86_OP_OR,
CG_X86_REG_EBX,
42);

and

(emit code '(or (reg ebx) (imm 42)))

Cheers,

Bob

Rod Pemberton

unread,

Mar 25, 2010, 6:21:08 PM3/25/10

to

"cr88192" <cr8...@hotmail.com> wrote in message
news:hofus4$a0r$1...@news.albasani.net...

> basically, it is the question of whether or not a textual assembler is
fast
> enough for use in a JIT context (I believe it is, and that one can benefit
> notably from using textual ASM here).
>

Is TCC when used as TCCBOOT fast enough in a JIT context? ! ? ! ...

We know that interpreters are a bit slower than compilers, and compilers do
take some time too. How fast is fast enough is very relative to 1)
generation of microprocessor, 2) size of files, 3) in-memory or on-disk, 4)
language complexity, etc.

> so, some tests:
> basically, I have tried assembling a chunk of text over and over again (in
a
> loop) and figuring out how quickly it was pushing through ASM.

You may just be testing the OS's buffering abilities here...

> initially, I found that my assembler was not performing terribly well, and
> the profiler showed that most of the time was going into zeroing memory. I
> fixed this, partly both by reducing the size of some buffers, and in a few
> cases disabling the 'memset' calls.

Instead of memset()-ing entire strings, you might try just setting the first
char to a nul character: str[0]='\0'; It's not as safe, but if your code
is without errors, it shouldn't be an issue.

Instead of strcmp(), you can try switches on single chars, while progressing
through the chars needed to obtain the required info. Sometimes this works
because you only need one or two characters out of much longer keywords to
distinguish it from other keywords.

Character directed parsing can speed things up too. Determining what the
syntax component is, say integer or keyword, takes time. But, if you put a
character infront that indicates what follows, you don't have
to do that processing to determine if it's an integer or keyword. E.g., an
example from an assembler of mine:

.eax _out $255

"dot" indicates a register follows. "underscore" indicates instruction
follows. "dollar-sign" indicates a decimal integer follows. Each directive
character is passed to switch() which selects the appropriate parsing
operation. The parser doesn't have to determine _what_ "eax" or "out" or
"255" is. It "knows" from the syntax. That's a large part of parsing logic
eliminated. When you program, you know what the directive character is and
can easily insert the correct character. Code generators also "know" too -
since you coded it... It's just an inconvenience to type the extra
characters, if you're doing alot of assembly.

If you use memory instead of file I/O, processing will be faster. Linked
lists, esp. doubly linked, can also speed up in-memory processing.
Allocation of memory in a single large block, instead of calling malloc()
repeatedly "as you go" or as needed, can simplify the arrangement of objects
in the allocated memory. It can eliminate pointers. Reduce the object
size. etc.

> I then noted that most of the time was going into my case-insensitive
> compare function, which is a bit slower than the case-sensitive compare
> function (strcmp).

Decide on one case, such as lowercase. That cuts your processing in half.
Use hash functions. They can eliminate multiple strcmp()'s. Try to only
strcmp() once, to eliminate possible collisions.

> and, as well, I guess the volumes of ASM I assemble are low enough that it
> has not been much of an issue thus far (I tend not to endlessly
re-assemble
> all of my libraries, as most loadable modules are in HLL's, and binary
> object-caching tends to be used instead of endless recompilation...).

If it's low use, you can eliminate much code by removing checks. I.e., if
you know your compiler correctly emits registers "eax", "ebx", etc., don't
implement a check for invalid registers. Some people would call such
techniques "unsafe" programming - which is true. But, since the code is
used in a controlled environment and without any "garbage" input, it'll
speed things up if the code does less work such as safety checks.

Rod Pemberton

BGB / cr88192

unread,

Mar 25, 2010, 8:22:35 PM3/25/10

to

"Maxim S. Shatskih" <ma...@storagecraft.com.no.spam> wrote in message

news:hogeuf$rgr$1...@news.mtu.ru...

> IL is loaded, run through codegen, and then:
> A. textual ASM is produced, and run through an assembler, and linked into
> the running image;
> B. machine code is produced in-place.

<--

So, ASM is the intermediate form of IL->machine code translator.

And why this form? maybe there are other ways, more effective?

-->

ASM is the accepted way, at least in more traditional compilers.
I am ignoring GAS here though, as GAS's syntax is lame...

> A. has the potential of a much cleaner implementation,

<--
Matter of taste.
-->

granted, but thus far I have not seen any other particularly good
strategies.
sequential API calls to emit opcodes would be lame IMO.

>... but the cost that it is not as fast and, on average, there is more code
>(both in the codegen,
> + the code for the assembler);

<--

Perf loss, code complexity loss - all for the matter of taste.
-->

why not?...

one may also choose to use XML DOM nodes in place of more specialized
structures for internal compiler machinery, and they carry similar costs.

BGB / cr88192

unread,

Mar 25, 2010, 11:04:05 PM3/25/10

to

[was responding to this earlier, but Windows blue-screened...].

"Rod Pemberton" <do_no...@havenone.cmm> wrote in message
news:hognj6$orq$1...@speranza.aioe.org...

> "cr88192" <cr8...@hotmail.com> wrote in message
> news:hofus4$a0r$1...@news.albasani.net...
>> basically, it is the question of whether or not a textual assembler is
> fast
>> enough for use in a JIT context (I believe it is, and that one can
>> benefit
>> notably from using textual ASM here).
>>
>
> Is TCC when used as TCCBOOT fast enough in a JIT context? ! ? ! ...

can't say, I have not used tcc.
I hear it compiles fairly fast though.

> We know that interpreters are a bit slower than compilers, and compilers
> do
> take some time too. How fast is fast enough is very relative to 1)
> generation of microprocessor, 2) size of files, 3) in-memory or on-disk,
> 4)
> language complexity, etc.

there is no disk IO here...

>> so, some tests:
>> basically, I have tried assembling a chunk of text over and over again
>> (in
> a
>> loop) and figuring out how quickly it was pushing through ASM.
>
> You may just be testing the OS's buffering abilities here...
>

no OS involvement, only memory buffers...

>> initially, I found that my assembler was not performing terribly well,
>> and
>> the profiler showed that most of the time was going into zeroing memory.
>> I
>> fixed this, partly both by reducing the size of some buffers, and in a
>> few
>> cases disabling the 'memset' calls.
>
> Instead of memset()-ing entire strings, you might try just setting the
> first
> char to a nul character: str[0]='\0'; It's not as safe, but if your code
> is without errors, it shouldn't be an issue.
>

the memory zeroing was mostly in my COFF writer, which initially used, and
zeroed, a fairly large temporary buffer.

I since both made the temp buffer smaller and disabled the memset, so this
is no longer an issue.

> Instead of strcmp(), you can try switches on single chars, while
> progressing
> through the chars needed to obtain the required info. Sometimes this
> works
> because you only need one or two characters out of much longer keywords to
> distinguish it from other keywords.
>

possible, however switches are in general a fairly awkward way to select
between tokens.

> Character directed parsing can speed things up too. Determining what the
> syntax component is, say integer or keyword, takes time. But, if you put
> a
> character infront that indicates what follows, you don't have
> to do that processing to determine if it's an integer or keyword. E.g.,
> an
> example from an assembler of mine:
>
> .eax _out $255
>
> "dot" indicates a register follows. "underscore" indicates instruction
> follows. "dollar-sign" indicates a decimal integer follows. Each
> directive
> character is passed to switch() which selects the appropriate parsing
> operation. The parser doesn't have to determine _what_ "eax" or "out" or
> "255" is. It "knows" from the syntax. That's a large part of parsing
> logic
> eliminated. When you program, you know what the directive character is
> and
> can easily insert the correct character. Code generators also "know"
> too -
> since you coded it... It's just an inconvenience to type the extra
> characters, if you're doing alot of assembly.

this, of course, would break NASM syntax compatibility (as well as break a
lot of the code already existing within my codebase).

> If you use memory instead of file I/O, processing will be faster. Linked
> lists, esp. doubly linked, can also speed up in-memory processing.
> Allocation of memory in a single large block, instead of calling malloc()
> repeatedly "as you go" or as needed, can simplify the arrangement of
> objects
> in the allocated memory. It can eliminate pointers. Reduce the object
> size. etc.

my assembler uses relatively few in-memory objects.
mostly, it is buffer operations...

no file IO is used here, as file-IO is teh-slow, and also makes little sense
for moving data from place-to-place within an app...

>> I then noted that most of the time was going into my case-insensitive
>> compare function, which is a bit slower than the case-sensitive compare
>> function (strcmp).
>
> Decide on one case, such as lowercase. That cuts your processing in half.
> Use hash functions. They can eliminate multiple strcmp()'s. Try to only
> strcmp() once, to eliminate possible collisions.

hashes are already used.
case-insensitive handling is used as my assembler was based some off of
NASM's syntax, and NASM is case-insensitive.

admitted, there are some differences between them, and adding case
sensitivity would be just another item to the list...

>> and, as well, I guess the volumes of ASM I assemble are low enough that
>> it
>> has not been much of an issue thus far (I tend not to endlessly
> re-assemble
>> all of my libraries, as most loadable modules are in HLL's, and binary
>> object-caching tends to be used instead of endless recompilation...).
>
> If it's low use, you can eliminate much code by removing checks. I.e., if
> you know your compiler correctly emits registers "eax", "ebx", etc., don't
> implement a check for invalid registers. Some people would call such
> techniques "unsafe" programming - which is true. But, since the code is
> used in a controlled environment and without any "garbage" input, it'll
> speed things up if the code does less work such as safety checks.

my main codegen is only one place which emits ASM.

there are many other things which emit ASM, as it is currently the main
language in use for dynamically-generated code fragments.

no error checking code shows significantly on the profiler though, and most
of my optimization effort was profiler driven...

BGB / cr88192

unread,

Mar 25, 2010, 11:23:18 PM3/25/10

to

"Robbert Haarman" <comp.la...@inglorion.net> wrote in message

news:2010032520...@yoda.inglorion.net...

> On Thu, Mar 25, 2010 at 12:22:02PM -0700, cr88192 wrote:

<snip>

>>
>> > To answer all the questions here, it would probably be a good idea to
>> > first come up with a definition of "fast enough", and then, if you find
>> > your program isn't fast enough by this definition, to profile it to
>> > figure
>> > out where it is spending most of its time.
>>
>> well, I know about my programs.
>> the question is, what about everyone else?...
>
> I wouldn't worry about that too much, unless your code generator is the
> most interesting part of what you are making. First, make it work. Then
> you can think about making it better - assuming you don't have more
> interesting things to tackle.
>

my code generator has been working for 3 years now...

the whole point of all of this would be if other people can/should use
textual assemblers rather than raw machine code.

I guess maybe the issue is some about performance, and maybe 20 or 50MB/s
would be needed for it to be "fast enough"...

>> > Another question is why you would be going through assembly code at
>> > all.
>> > What benefit does it provide, compared to, for example, generating
>> > machine
>> > code directly? Surely, if speed is a concern, you could benefit from
>> > cutting out the assembler altogether.
>>
>> producing it directly is IMO a much less nice option.
>> it is barely even a workable strategy with typical bytecode formats, and
>> with x86 machine code would probably suck...
>
> I don't really see that. The way I see it, most of the work is in getting
> from what you have (presumably some instruction-set-independent source
> code
> or intermediate representation) to the instructions of your target
> platform.
> Once you are there, I think emitting these instructions as binary or as
> text doesn't make too much of a difference.

well, one has different code to emit either one.

for textual ASM, it is a huge mass of "print" statements.
for raw machine code, likely it would be a mass of "*ct++=0xB8;" or similar.

API-driven assemblers are sort of middle-ground.

fooasm_mov_regreg(fooasm_eax, fooasm_ecx);
...

> I've written code to emit binary instructions for various targets, and,
> in my experience, it's not very hard. Sure, x86's ModRM is a bit tricky,
> but you write that once and then it will just sit there, doing its job.
> In the grand scheme of writing a compiler, this isn't a big deal.

it is not "tricky", it is tedious and it is nasty...

by the time one writes a function to handle ModRM for them, they will be
tempted to write a function for REX, and maybe for the opcodes, and soon
enough they are on their way to having an assembler...

> Generating the instructions in binary form right away also makes it very
> easy to know exactly where your code ends up and what its size is, which
> may actually make it _easier_ to patch addresses into your code and
> make decisions about short vs. long jumps.

well, a very simple strategy works well enough: "jmp foo".

and the assembler figures out whether a long or short jump is needed...

>> admittedly, if really needed I could add a binary-ASM API to my assembler
>> (would allow using function calls to generate ASM), but this is likely to
>> be
>> much less nice than using a textual interface, and could not likely
>> optimize
>> jumps (likely short jumps would need to be explicit).
>
> My experience is that how nice the API is depends very much on the
> language you express it in. For example, I've tried to come up with a nice
> API for instruction generation in C, but never got it to the point where
> I was really happy with it. In a language which lets you write out a
> data structure in-line, preferably with automatic memory management and
> namespaces, this is much easier.

C is assumed here...

> It's the difference between, for example:
>
> n += cg_x86_emit_reg32_imm8_instr(code + n,
> sizeof(code) - n,
> CG_X86_OP_OR,
> CG_X86_REG_EBX,
> 42);
>
> and
>
> (emit code '(or (reg ebx) (imm 42)))
>

yep.

my assemblers' original binary API wasn't too much different than the
above...

it was so horrible that originally I wrote the parser mostly to wrap these
horrid-looking API calls...

digging around, I eventually found some old code of mine (from jan 2007)
targetting this original API:

ASM_EmitLabel(ctx, "$incref");
ASM_OutOpRegImm(ctx, ASM_OP_MOV, ASM_EAX, (int)(&BS1_GC_IncRef));
ASM_OutOpReg(ctx, ASM_OP_CALL, ASM_EAX);
ASM_OutOpSingle(ctx, ASM_OP_RET);

ASM_EmitLabel(ctx, "$decref");
ASM_OutOpRegImm(ctx, ASM_OP_MOV, ASM_EAX, (int)(&BS1_GC_DecRef));
ASM_OutOpReg(ctx, ASM_OP_CALL, ASM_EAX);
ASM_OutOpSingle(ctx, ASM_OP_RET);

ASM_EmitLabel(ctx, "$incref_eax");
ASM_OutOpReg(ctx, ASM_OP_PUSH, ASM_EAX);
ASM_OutOpRegImm(ctx, ASM_OP_MOV, ASM_EAX, (int)(&BS1_GC_IncRef));
ASM_OutOpReg(ctx, ASM_OP_CALL, ASM_EAX);
ASM_OutOpReg(ctx, ASM_OP_POP, ASM_EAX);
ASM_OutOpSingle(ctx, ASM_OP_RET);

ASM_EmitLabel(ctx, "$decref_eax");
ASM_OutOpReg(ctx, ASM_OP_PUSH, ASM_EAX);
ASM_OutOpRegImm(ctx, ASM_OP_MOV, ASM_EAX, (int)(&BS1_GC_DecRef));
ASM_OutOpReg(ctx, ASM_OP_CALL, ASM_EAX);
ASM_OutOpReg(ctx, ASM_OP_POP, ASM_EAX);
ASM_OutOpSingle(ctx, ASM_OP_RET);

Alexei A. Frounze

unread,

Mar 26, 2010, 1:40:28 AM3/26/10

to

On Mar 25, 1:19 pm, Robbert Haarman <comp.lang.m...@inglorion.net>
wrote:
...

> It's the difference between, for example:
>
> n += cg_x86_emit_reg32_imm8_instr(code + n,
> sizeof(code) - n,
> CG_X86_OP_OR,
> CG_X86_REG_EBX,
> 42);
>
> and
>
> (emit code '(or (reg ebx) (imm 42)))

Umm... Looks Lispy! :)

For fun I've once implemented an x86 assembler (NASMish, but with much
less functionality) in Perl. It was pretty compact (~50KB of source
code). A C solution would've been much bigger. The perf relationship
would've been the opposite. Which is, nonetheless, to say, domain
specific or task oriented languages are a good thing.

Alex

octavio

unread,

Mar 26, 2010, 8:18:30 AM3/26/10

to

I use something similar to JIT in 'octaos' and is fast enought even on
older computers.Instead of using a intermediate language or binary
format it works directly with sources.
'octasm' can assemble about 1 million of instructions with a atom
1.6Ghz cpu but tipical programs just have a few thousands instructions
because the operating system provide a library that makes most of the
work.
I don't like to use MB or number of lines to measure since it depends
on the programing style and coments, long names or empty lines don't
slow down very much.Also the multimedia data that many aplications
include does not count ,since it would take the same time to load with
a executable file.Well written programs should never need more that 1
million instructions.
Parsing case insensitive sources should not be a big problem in your
assembler,just do a table lokup to obtain the upercase char and the
token type.

BGB / cr88192

unread,

Mar 26, 2010, 1:21:13 PM3/26/10

to

"octavio" <octavio.veg...@gmail.com> wrote in message
news:5327089e-81c5-4a6e...@d27g2000yqf.googlegroups.com...

I was using MB/s for the ASM code mostly as it is easy to calculate.

the fragment I am testing is essentially comment-free, and has very few
empty lines (mostly, if is a glob of code from my main codegen implementing
a lot of basic operations for 128-bit integer values).

anyways, I currently have the thing assembling code at around 2.2 MB/s...

(currently this means re-assembling my blob of ASM code around 3000 times in
10s).

checking lines, 280 lines are currently in use, so, 840000 lines in 10s, or
84000 loc/s.
this means at present, ~11.9us per line.

maybe better if I can get the time per loc a little lower...

note that my blurb does multiple opcodes per-line, since my assembler
supports this.
splitting out to a single opcode per line produces ~500 lines, meaning
1500000 lines in 10s, or 150000 lines per second, or ~ 6.7us per
opcode/line.

syntax is still presently mostly case-insensitive (although, it is
case-insensitive for a few things, but stricmp is no longer high on the
profiler list).

BGB / cr88192

unread,

Mar 26, 2010, 1:30:26 PM3/26/10

to

"Alexei A. Frounze" <alexf...@gmail.com> wrote in message
news:7bb8d1d3-5ea4-4804...@f14g2000pre.googlegroups.com...

On Mar 25, 1:19 pm, Robbert Haarman <comp.lang.m...@inglorion.net>
wrote:
...
> It's the difference between, for example:
>
> n += cg_x86_emit_reg32_imm8_instr(code + n,
> sizeof(code) - n,
> CG_X86_OP_OR,
> CG_X86_REG_EBX,
> 42);
>
> and
>
> (emit code '(or (reg ebx) (imm 42)))

<--
Umm... Looks Lispy! :)

For fun I've once implemented an x86 assembler (NASMish, but with much
less functionality) in Perl. It was pretty compact (~50KB of source
code). A C solution would've been much bigger. The perf relationship
would've been the opposite. Which is, nonetheless, to say, domain
specific or task oriented languages are a good thing.

-->

the main assembler machinery is about 100kB of C source (parser + opcode
generating logic).

20kB is used for the COFF writer, and 120kB for the opcode-tables
(mechanically-generated C).

the whole thing is a bit larger though if everything else were counted (the
linker, disassembler, a lot of special-purpose logic code, ...).

Rugxulo

unread,

Mar 26, 2010, 1:54:24 PM3/26/10

to

Hi,

On Mar 25, 10:04 pm, "BGB / cr88192" <cr88...@hotmail.com> wrote:
>
> "Rod Pemberton" <do_not_h...@havenone.cmm> wrote in message

>
> > Is TCC when used as TCCBOOT fast enough in a JIT context? ! ? ! ...
>
> can't say, I have not used tcc.
> I hear it compiles fairly fast though.

It's one pass, built-in assembler and linker, and its optimizations
are less than GCC, so that's why. (Although, honestly, Fabrice Bellard
deserves most of the credit.)

Octasm is similarly fast because it's written in itself by a smart
programmer (hi !) and is very cautious about multiple passes. FASM's
author was very very glad to receive tips from Octavio concerning
this. I think he called it the "best suggestion ever" (and he's no
slouch either).

Sorry, can't find that link, but here's when Privalov started speeding
it up, circa 1.50 or such (maybe that'll give some good ideas):

http://board.flatassembler.net/topic.php?t=854

BGB / cr88192

unread,

Mar 26, 2010, 3:24:18 PM3/26/10

to

"Rugxulo" <rug...@gmail.com> wrote in message
news:5f74e08b-94ea-4648...@i25g2000yqm.googlegroups.com...
Hi,

On Mar 25, 10:04 pm, "BGB / cr88192" <cr88...@hotmail.com> wrote:
>
> "Rod Pemberton" <do_not_h...@havenone.cmm> wrote in message
>
> > Is TCC when used as TCCBOOT fast enough in a JIT context? ! ? ! ...
>
> can't say, I have not used tcc.
> I hear it compiles fairly fast though.

<--

It's one pass, built-in assembler and linker, and its optimizations
are less than GCC, so that's why. (Although, honestly, Fabrice Bellard
deserves most of the credit.)

-->

yeah.
forcing my assembler into single-pass mode effectively doubles its speed
(but disables automatic jump optimization).

so, it is currently 2.55 MB/s with multi-passes allowed, and 4.9 MB/s
single-pass.
(I have spent a lot of the morning fiddly micro-optimizing the damn
thing...).

this puts it at currently about 3us per opcode (323817 opcodes/sec).

so, I may add an optional "fast" mode which will, among other things:
disable multi-pass assembly (short jumps would need to be explicit);
disables the preprocessor;
...

<--

Octasm is similarly fast because it's written in itself by a smart
programmer (hi !) and is very cautious about multiple passes. FASM's
author was very very glad to receive tips from Octavio concerning
this. I think he called it the "best suggestion ever" (and he's no
slouch either).

-->

yep.

<--

Sorry, can't find that link, but here's when Privalov started speeding
it up, circa 1.50 or such (maybe that'll give some good ideas):

http://board.flatassembler.net/topic.php?t=854
-->

yes, ok.

BGB / cr88192

unread,

Mar 28, 2010, 1:53:21 AM3/28/10

to

"cr88192" <cr8...@hotmail.com> wrote in message
news:hofus4$a0r$1...@news.albasani.net...

> well, this was a recent argument on comp.compilers, but I figured it may
> make some sense in a "freer" context.
>

well, a status update:
1.94 MB/s is the speed which can be gained with "normal" operation (textual
interface, preprocessor, jump optimization, ...);
5.28 MB/s can be gained via "fast" mode, which bypasses the preprocessor and
forces single-pass assembly.

10MB/s (analogue) can be gained by using a direct binary interface (newly
added).
in the case of this mode, most of the profile time goes into a few predicate
functions, and also the function for emitting opcode bytes. somehow, I don't
think it is likely to be getting that much faster.

stated another way: 643073 opcodes/second, or about 1.56us/op.
calculating from CPU speed, this is around 3604 clock cycles / opcode (CPU =
2.31 GHz).

basically, I have a personal optimization hueristic:
when the top item reported by the profiler is the entry point to a switch
statement, it is not likely that all that many more optimizations are gained
(the so-called "switch limit"). a variant of this has happened in this case.

in the binary mode, the test fragment is pre-parsed into an array of
struct-pointers, and these structs are used to drive the assembler internals
(with pre-resolved opcode numbers, ...).

the fragment has 462 ops and manages to be re-assembled 41758 times before
the timer expires (timer expire is 30s, so 1391 re-assembles/second).

to get any faster would likely involve sidestepping the assembler as well
(such as using a big switch and emitting bytes), but this is not something I
am going to test (would make about as much sense as benchmarking it against
memcpy or similar, since yes, memcpy is faster, but no, it is not an
assembler...).

so, at the moment, this means an approx 5x speed difference between the
fastest and the slowest modes.

I am not really sure if this is all that drastic of a difference...

or such...

Rod Pemberton

unread,

Mar 28, 2010, 3:16:54 AM3/28/10

to

"BGB / cr88192" <cr8...@hotmail.com> wrote in message
news:homqsi$s25$1...@news.albasani.net...
> [...]

> 10MB/s (analogue) can be gained by using a direct binary interface (newly
> added).
> in the case of this mode, most of the profile time goes into a few
predicate
> functions, and also the function for emitting opcode bytes. somehow, I
don't
> think it is likely to be getting that much faster.
>

A few years ago, I posted the link below for large single file programs
(talking to you...). I'm not sure if you ever looked their file sizes, but
the largest two were gcc as a single file and an ogg encoder as a single
file, at 3.2MB and 1.7MB respectively. Those are probably the largest
single file C programs you'll see. It's possible, even likely, some
multi-file project, say the Linux kernel etc., is larger. But, 10MB/s
should still be very good for most uses. But, there's no reason to stop
there, if you've got the time!

http://people.csail.mit.edu/smcc/projects/single-file-programs/

> stated another way: 643073 opcodes/second, or about 1.56us/op.
> calculating from CPU speed, this is around 3604 clock cycles / opcode (CPU
=
> 2.31 GHz).

BTW, what brand of cpu, and what number of cores are being used?

> to get any faster would likely involve sidestepping the assembler as well
> (such as using a big switch and emitting bytes), but this is not something
I
> am going to test (would make about as much sense as benchmarking it
against
> memcpy or similar, since yes, memcpy is faster, but no, it is not an
> assembler...).

OpenWatcom is (or was) one of the fastest C compilers I've used. It skipped
emitting assembly. Given the speed, I'm sure they did much more than
that... It might provide a reference point for a speed comparison. I
haven't used more recent versions (I'm using v1.3). So, I'm assuming the
speed is still there.

Rod Pemberton

Robbert Haarman

unread,

Mar 28, 2010, 3:41:38 AM3/28/10

to

On Sat, Mar 27, 2010 at 10:53:21PM -0700, BGB / cr88192 wrote:
>
> "cr88192" <cr8...@hotmail.com> wrote in message
> news:hofus4$a0r$1...@news.albasani.net...
>

> well, a status update:
> 1.94 MB/s is the speed which can be gained with "normal" operation (textual
> interface, preprocessor, jump optimization, ...);
> 5.28 MB/s can be gained via "fast" mode, which bypasses the preprocessor and
> forces single-pass assembly.
>
>
> 10MB/s (analogue) can be gained by using a direct binary interface (newly
> added).
> in the case of this mode, most of the profile time goes into a few predicate
> functions, and also the function for emitting opcode bytes. somehow, I don't
> think it is likely to be getting that much faster.
>
> stated another way: 643073 opcodes/second, or about 1.56us/op.
> calculating from CPU speed, this is around 3604 clock cycles / opcode (CPU =
> 2.31 GHz).

To provide another data point:

First, some data from /proc/cpuinfo:

model name : AMD Athlon(tm) Dual Core Processor 5050e
cpu MHz : 2600.000
cache size : 512 KB
bogomips : 5210.11

I did a quick test using the Alchemist code generation library. The
instruction sequence I generated is:

00000000 33C0 xor eax,eax
00000002 40 inc eax
00000003 33DB xor ebx,ebx
00000005 83CB2A or ebx,byte +0x2a
00000008 CD80 int 0x80

for a total of 10 bytes. Doing this 100000000 (a hundred million) times
takes about 4.7 seconds.

Using the same metrics that you provided, that is:

About 200 MB/s
About 100 million opcodes generated per second
About 24 CPU clock cycles per opcode generated

Cheers,

Bob

Rod Pemberton

unread,

Mar 28, 2010, 4:22:48 AM3/28/10

to

"Robbert Haarman" <comp.la...@inglorion.net> wrote in message

news:2010032807...@yoda.inglorion.net...

>
> First, some data from /proc/cpuinfo:
>
> model name : AMD Athlon(tm) Dual Core Processor 5050e
> cpu MHz : 2600.000
> cache size : 512 KB
> bogomips : 5210.11
>

Unrelated FYI, your BogoMips should be twice that for that cpu. I suspect
you listed it for _one_ core, as /proc/cpuinfo does. Look in
/var/log/messages to see if your total is twice. It should say both cores
are activated and list the total. I'm really not sure what anyone could use
BogoMips for...

Rod Pemberton

Branimir Maksimovic

unread,

Mar 28, 2010, 4:58:29 AM3/28/10

to

Well, actually Linux shows that bogomips depending on
bios feagures not real feagures. For example
if you put 400mhz FSB and multiplier 8
it will not show 3.2GHZ but 3.6 if you multiplier
max is 9.
For same reason if you put 400mhz auto multiplier
and speedstep enabled it will show 2ghz when multiplier
iz 6 and 3 ghz when multiplier is 9,
but actually clock is 2.4GHZ,3.6HZ not 2GHZ/3GHZ
as shown.

Robbert Haarman

unread,

Mar 28, 2010, 5:49:04 AM3/28/10

to

Hi Rod,

On Sun, Mar 28, 2010 at 04:22:48AM -0400, Rod Pemberton wrote:
> "Robbert Haarman" <comp.la...@inglorion.net> wrote in message
> news:2010032807...@yoda.inglorion.net...
> >
> > First, some data from /proc/cpuinfo:
> >
> > model name : AMD Athlon(tm) Dual Core Processor 5050e
> > cpu MHz : 2600.000
> > cache size : 512 KB
> > bogomips : 5210.11
> >
>
> Unrelated FYI, your BogoMips should be twice that for that cpu. I suspect
> you listed it for _one_ core, as /proc/cpuinfo does.

Yes. These lines are taken from /proc/cpuinfo, and are for one of the two
cores. The BogoMIPS rating for both cores taken together is indeed twice
that.

Note that the benchmark I ran only uses a single core. I also performed
my calculations as if there was only a single core. That is, the 24 cycles
per generated instruction are those of the core generating the code;
cycles of the core that is sitting idle are not taken into account.

Alchemist is not currently thread-safe, because of two pieces of global
state: a mode which can be set to 32-bit or 16-bit, and an error variable.
It would not be hard to make this state thread-local, or indeed to change the
interface so that global state is eliminated entirely, but I am not currently
working on Alchemist anymore, and even if I were, this change wouldn't be
very high on my priority list.

Regards,

Bob

BGB / cr88192

unread,

Mar 28, 2010, 11:25:45 AM3/28/10

to

"Rod Pemberton" <do_no...@havenone.cmm> wrote in message

news:homvnn$f7j$1...@speranza.aioe.org...

> "BGB / cr88192" <cr8...@hotmail.com> wrote in message
> news:homqsi$s25$1...@news.albasani.net...
>> [...]
>> 10MB/s (analogue) can be gained by using a direct binary interface (newly
>> added).
>> in the case of this mode, most of the profile time goes into a few
> predicate
>> functions, and also the function for emitting opcode bytes. somehow, I
> don't
>> think it is likely to be getting that much faster.
>>
>
> A few years ago, I posted the link below for large single file programs
> (talking to you...). I'm not sure if you ever looked their file sizes,
> but
> the largest two were gcc as a single file and an ogg encoder as a single
> file, at 3.2MB and 1.7MB respectively. Those are probably the largest
> single file C programs you'll see. It's possible, even likely, some
> multi-file project, say the Linux kernel etc., is larger. But, 10MB/s
> should still be very good for most uses. But, there's no reason to stop
> there, if you've got the time!
>
> http://people.csail.mit.edu/smcc/projects/single-file-programs/
>

now that I am reminded, I remember them some, but not much...

>> stated another way: 643073 opcodes/second, or about 1.56us/op.
>> calculating from CPU speed, this is around 3604 clock cycles / opcode
>> (CPU
> =
>> 2.31 GHz).
>
> BTW, what brand of cpu, and what number of cores are being used?
>

AMD Athlon 64 X2 4400.
however, all this runs in a single thread, so the number of cores doesn't
effect much.

internally, it runs at 2.31 GHz I think, and this becomes more notable when
doing some types of benchmarks.

my newer laptop has an Pentium 4M or similar, and outperforms my main
computer for raw computational tasks, but comes with rather lame video HW
(and so still can't really play any games much newer than HL2, which runs
similarly well on my old laptop despite my old laptop being much slower in
general...).

>> to get any faster would likely involve sidestepping the assembler as well
>> (such as using a big switch and emitting bytes), but this is not
>> something
> I
>> am going to test (would make about as much sense as benchmarking it
> against
>> memcpy or similar, since yes, memcpy is faster, but no, it is not an
>> assembler...).
>
> OpenWatcom is (or was) one of the fastest C compilers I've used. It
> skipped
> emitting assembly. Given the speed, I'm sure they did much more than
> that... It might provide a reference point for a speed comparison. I
> haven't used more recent versions (I'm using v1.3). So, I'm assuming the
> speed is still there.
>

well, all this is for my assembler (written in C), but it assembles ASM
code.

note that my struct-array interface doesn't currently implement all the
features of the assembler.

BGB / cr88192

unread,

Mar 28, 2010, 12:07:05 PM3/28/10

to

"Robbert Haarman" <comp.la...@inglorion.net> wrote in message
news:2010032807...@yoda.inglorion.net...

well, that is actually a faster processor than I am using...

> I did a quick test using the Alchemist code generation library. The
> instruction sequence I generated is:
>
> 00000000 33C0 xor eax,eax
> 00000002 40 inc eax
> 00000003 33DB xor ebx,ebx
> 00000005 83CB2A or ebx,byte +0x2a
> 00000008 CD80 int 0x80
>
> for a total of 10 bytes. Doing this 100000000 (a hundred million) times
> takes about 4.7 seconds.
>

I don't know the bytes output, I was measuring bytes of textual-ASM input:
"num_loops * strlen(input);" essentially.

in the structs-array case, I pre-parsed the example, but continued to
measure against this sample (as-if it were still being assembled each time).

> Using the same metrics that you provided, that is:
>
> About 200 MB/s
> About 100 million opcodes generated per second
> About 24 CPU clock cycles per opcode generated
>

yeah, but they are probably doing something differently.

I found an "alchemist code generator", but it is a commercial app which
processes XML and uses an IDE, so maybe not the one you are referencing
(seems unlikely).

my lib is written in C, and as a general rule has not been "micro-turned for
max performance" or anything like this (and also is built with MSVC, with
debug settings).

I have been generally performance-tuning a lot of the logic, but not
actually changing much of its overall workings (since notable structural
changes would risk breaking the thing).

mine also still goes through most of the internal logic of the assembler,
mostly bypassing the front-end parser and using pre-resolved opcode numbers
and similar.

emitting each byte is still a function call, and may check for things like
the need to expand the buffer, ...
the output is still packaged into COFF objects (though little related to
COFF is all that notable on the profiler).

similarly, the logic for encoding the actual instructions is still
ASCII-character-driven-logic (it loops over a string, using characters to
give commands such as where the various prefixes would go, where REX goes,
when to place the ModRM bytes, ...). actually, the logic is driven by an
expanded form of the notation from the Intel docs...

there is very little per-instruction logic (such as instruction-specific
emitters), since this is ugly and would have made the thing larger and more
complicated (but, granted, it would have been technically faster).

hence, why I say this is a case of the "switch limit", which often causes a
problem for interpreters:
most of the top places currently in the profiler are switch statements...

this ASCII-driven-logic is actually the core structure of the assembler, and
so is not really removable. otherwise my tool for writing parts of my
assembler for me would have to be much more complicated (stuff is generated
from the listings, which tell about things like how the instructions are
structured, what registers exist, ...).

actually, a lot of places in my framework are based around ASCII-driven
logic (strings are used, with characters used to drive particular actions in
particular pieces of code, typically via switch statements).

this would include my x86 interpreter, which reached about 1/70th native
speed.

but, hell, people would probably really like my C compiler upper-end, as
this is essentially a huge mass of XML-processing code... (although no XSLT,
instead mostly masses of C code which recognizes specific forms and work
with them...).

Branimir Maksimovic

unread,

Mar 28, 2010, 1:14:33 PM3/28/10

to

Well, measured some quad xeon against dual athlon slower
than your in initializing 256mb of ram 4 threads xeon, 2 threads
athlon, same speed.
Point is that same speed was with 3.2 ghz strongest dual athlon as well.
Intel external memory controller models are slower with memory
than athlons. You need to overclock to at least 400mhz FSB to compete
with athlons.

Greets

Robbert Haarman

unread,

Mar 28, 2010, 1:37:13 PM3/28/10

to BGB / cr88192

Hi cr,

On Sun, Mar 28, 2010 at 09:07:05AM -0700, BGB / cr88192 wrote:
>
> "Robbert Haarman" <comp.la...@inglorion.net> wrote in message
> news:2010032807...@yoda.inglorion.net...
> > On Sat, Mar 27, 2010 at 10:53:21PM -0700, BGB / cr88192 wrote:
> >>
> >> "cr88192" <cr8...@hotmail.com> wrote in message
> >> news:hofus4$a0r$1...@news.albasani.net...
> >>

> >> 10MB/s (analogue) can be gained by using a direct binary interface (newly
> >> added).
> >> in the case of this mode, most of the profile time goes into a few
> >> predicate
> >> functions, and also the function for emitting opcode bytes. somehow, I
> >> don't
> >> think it is likely to be getting that much faster.
> >>
> >> stated another way: 643073 opcodes/second, or about 1.56us/op.
> >> calculating from CPU speed, this is around 3604 clock cycles / opcode
> >> (CPU =
> >> 2.31 GHz).
> >
> > To provide another data point:
> >
> > First, some data from /proc/cpuinfo:
> >
> > model name : AMD Athlon(tm) Dual Core Processor 5050e
> > cpu MHz : 2600.000
> > cache size : 512 KB
> > bogomips : 5210.11
> >
>
> well, that is actually a faster processor than I am using...

Yes, it is. That's why I posted it. I am sure the results I got aren't
directly comparable to yours, and the different CPU is one of the reasons.

> > I did a quick test using the Alchemist code generation library. The
> > instruction sequence I generated is:
> >
> > 00000000 33C0 xor eax,eax
> > 00000002 40 inc eax
> > 00000003 33DB xor ebx,ebx
> > 00000005 83CB2A or ebx,byte +0x2a
> > 00000008 CD80 int 0x80
> >
> > for a total of 10 bytes. Doing this 100000000 (a hundred million) times
> > takes about 4.7 seconds.
> >
>
> I don't know the bytes output, I was measuring bytes of textual-ASM input:
> "num_loops * strlen(input);" essentially.

Oh, I see. I misunderstood you there. I thought you would be measuring
bytes of output, because your input likely wouldn't be the same size for
textual input vs. binary input.

Of course, that makes the MB/s figures we got completely incomparable.
I can't produce MB/s of input assembly code for my measurements, because,
in my case, there is no assembly code being used as input.

> in the structs-array case, I pre-parsed the example, but continued to
> measure against this sample (as-if it were still being assembled each time).

Right. I could, of course, come up with some assembly code corresponding to
the instructions that I'm generating, but I don't see much point to that.
First of all, the size would vary based on how you wrote the assembly code,
and, secondly, I'm not actually processing the assembly code at all, so
I don't think the numbers would be meaningful even as an approximation.

> > Using the same metrics that you provided, that is:
> >
> > About 200 MB/s
> > About 100 million opcodes generated per second
> > About 24 CPU clock cycles per opcode generated
> >
>
>
> yeah, but they are probably doing something differently.

Clearly, with the numbers being so different. :-) The point of posting these
numbers wasn't so much to show that the same thing you are doing can be
done in fewer instructions, but rather to give an idea of how much time
the generation of executable code costs using Alchemist. This is basically
the transition from "I know which instruction I want and which operands
I want to pass to it" to "I have the instruction at this address in memory".
In particular, Alchemist does _not_ parse assembly code, perform I/O,
have a concept of labels, or decide what kind of jump instruction you need.

> I found an "alchemist code generator", but it is a commercial app which
> processes XML and uses an IDE, so maybe not the one you are referencing
> (seems unlikely).

Right. The one I am talking about is at http://libalchemist.sourceforge.net/

> my lib is written in C, and as a general rule has not been "micro-turned for
> max performance" or anything like this (and also is built with MSVC, with
> debug settings).

Right, I forgot to mention my compiler settings. The results I posted
are using gcc 4.4.1-4ubuntu9, with -march-native -pipe -Wall -s -O3
-fPIC. So that's with quite a lot of optimization, although the code for
Alchemist hasn't been optimized for performance at all.

> emitting each byte is still a function call, and may check for things like
> the need to expand the buffer, ...

I expect that this may be costly, especially with debug settings enabled.
Alchemist doesn't make a function call for each byte emitted and doesn't
automatically expand the buffer, but it does perform a range check.

> the output is still packaged into COFF objects (though little related to
> COFF is all that notable on the profiler).

Right. Alchemist doesn't know anything about object file formats. It just
gives you the raw machine code.

> there is very little per-instruction logic (such as instruction-specific
> emitters), since this is ugly and would have made the thing larger and more
> complicated (but, granted, it would have been technically faster).

That may be a major difference, too. Alchemist has different functions for
emitting different kinds of instruction. For reference, the code that
emits the "or ebx,byte +0x2a" instruction above looks like this:

/* or ebx, 42 */

n += cg_x86_emit_reg32_imm8_instr(code + n,
sizeof(code) - n,
CG_X86_OP_OR,
CG_X86_REG_EBX,
42);

There are other functions for emitting code, with names like
cg_x86_emit_reg32_reg32_instr, cg_x86_emit_imm8_instr, etc.

Each of these functions contains a switch statement that looks at the
operation (an int) and then calls an instruction-format-specific function,
substituting the actual x86 opcode for the symbolic constant. A similar
scheme is used to translate the symbolic constant for a register name to
an actual x86 register code.

You can take a look at
http://repo.or.cz/w/alchemist.git/blob/143561d2347d492c570cde96481bac725042186c:/x86/lib/x86.c
for all the gory details, if you like.

Cheers,

Bob

BGB / cr88192

unread,

Mar 28, 2010, 1:49:49 PM3/28/10

to

"Branimir Maksimovic" <bm...@hotmail.com> wrote in message
news:20100328191433.5cf5b2f1@maxa...

> On Sun, 28 Mar 2010 08:25:45 -0700
> "BGB / cr88192" <cr8...@hotmail.com> wrote:
>

<snip>

>>
>> internally, it runs at 2.31 GHz I think, and this becomes more
>> notable when doing some types of benchmarks.
>>
>> my newer laptop has an Pentium 4M or similar, and outperforms my main
>> computer for raw computational tasks, but comes with rather lame
>> video HW (and so still can't really play any games much newer than
>> HL2, which runs similarly well on my old laptop despite my old laptop
>> being much slower in general...).
>
> Well, measured some quad xeon against dual athlon slower
> than your in initializing 256mb of ram 4 threads xeon, 2 threads
> athlon, same speed.
> Point is that same speed was with 3.2 ghz strongest dual athlon as well.
> Intel external memory controller models are slower with memory
> than athlons. You need to overclock to at least 400mhz FSB to compete
> with athlons.
>

well, whatever the case, my 2009-era laptop with an Pentium4 outperforms my
2007-era desktop with an Athlon 64 X2, at least for pure CPU tasks.

I haven't really compared them with memory-intensive tasks.

I put DDR-2 PC2-6400 RAM in my desktop, but the BIOS regards it as 5400 (as
does memtest86...).
I don't know what laptop uses.

for games, the main issue is video HW, as apparently the "Intel Mobile
Video" or whatever isn't exactly good...
main computer has a "Radeon HD 4850".

...

Branimir Maksimovic

unread,

Mar 28, 2010, 2:04:01 PM3/28/10

to

On Sun, 28 Mar 2010 10:49:49 -0700

"BGB / cr88192" <cr8...@hotmail.com> wrote:

>
> "Branimir Maksimovic" <bm...@hotmail.com> wrote in message
> news:20100328191433.5cf5b2f1@maxa...
> > On Sun, 28 Mar 2010 08:25:45 -0700
> > "BGB / cr88192" <cr8...@hotmail.com> wrote:
> >
>
> <snip>
>
> >>
> >> internally, it runs at 2.31 GHz I think, and this becomes more
> >> notable when doing some types of benchmarks.
> >>
> >> my newer laptop has an Pentium 4M or similar, and outperforms my
> >> main computer for raw computational tasks, but comes with rather
> >> lame video HW (and so still can't really play any games much newer
> >> than HL2, which runs similarly well on my old laptop despite my
> >> old laptop being much slower in general...).
> >
> > Well, measured some quad xeon against dual athlon slower
> > than your in initializing 256mb of ram 4 threads xeon, 2 threads
> > athlon, same speed.
> > Point is that same speed was with 3.2 ghz strongest dual athlon as
> > well. Intel external memory controller models are slower with memory
> > than athlons. You need to overclock to at least 400mhz FSB to
> > compete with athlons.
> >
>
> well, whatever the case, my 2009-era laptop with an Pentium4
> outperforms my 2007-era desktop with an Athlon 64 X2, at least for
> pure CPU tasks.
>

Intel core/2 is much faster than athlon per CPu tasks clock per clock,
when data is in cache, but ahtlon is faster regarding
when you have to write lot of data at same time.
That's why intel has larger cache to compensate that.
i7 changed that as it has internal memory controller.

Greets!

BGB / cr88192

unread,

Mar 28, 2010, 2:22:43 PM3/28/10

to

"Robbert Haarman" <ingl...@inglorion.net> wrote in message
news:2010032817...@yoda.inglorion.net...

yep.

>> > I did a quick test using the Alchemist code generation library. The
>> > instruction sequence I generated is:
>> >
>> > 00000000 33C0 xor eax,eax
>> > 00000002 40 inc eax
>> > 00000003 33DB xor ebx,ebx
>> > 00000005 83CB2A or ebx,byte +0x2a
>> > 00000008 CD80 int 0x80
>> >
>> > for a total of 10 bytes. Doing this 100000000 (a hundred million) times
>> > takes about 4.7 seconds.
>> >
>>
>> I don't know the bytes output, I was measuring bytes of textual-ASM
>> input:
>> "num_loops * strlen(input);" essentially.
>
> Oh, I see. I misunderstood you there. I thought you would be measuring
> bytes of output, because your input likely wouldn't be the same size for
> textual input vs. binary input.
>
> Of course, that makes the MB/s figures we got completely incomparable.
> I can't produce MB/s of input assembly code for my measurements, because,
> in my case, there is no assembly code being used as input.
>

yes.

I can't directly produce (meaningful) bytes of output either, since the
output is currently in the form of unlinked COFF objects...

>> in the structs-array case, I pre-parsed the example, but continued to
>> measure against this sample (as-if it were still being assembled each
>> time).
>
> Right. I could, of course, come up with some assembly code corresponding
> to
> the instructions that I'm generating, but I don't see much point to that.
> First of all, the size would vary based on how you wrote the assembly
> code,
> and, secondly, I'm not actually processing the assembly code at all, so
> I don't think the numbers would be meaningful even as an approximation.
>

yep.

>> > Using the same metrics that you provided, that is:
>> >
>> > About 200 MB/s
>> > About 100 million opcodes generated per second
>> > About 24 CPU clock cycles per opcode generated
>> >
>>
>>
>> yeah, but they are probably doing something differently.
>
> Clearly, with the numbers being so different. :-) The point of posting
> these
> numbers wasn't so much to show that the same thing you are doing can be
> done in fewer instructions, but rather to give an idea of how much time
> the generation of executable code costs using Alchemist. This is basically
> the transition from "I know which instruction I want and which operands
> I want to pass to it" to "I have the instruction at this address in
> memory".
> In particular, Alchemist does _not_ parse assembly code, perform I/O,
> have a concept of labels, or decide what kind of jump instruction you
> need.
>

mine does all this apart from the IO.

input and output is passed as buffers, although input can be done into the
assembler via "print" statements, which are buffered internally, which is
one of the main ways of using the assembler.

trivially different is the "puts" command, which doesn't do any formatting,
and hence is a little faster if the code is pre-formed.

>> I found an "alchemist code generator", but it is a commercial app which
>> processes XML and uses an IDE, so maybe not the one you are referencing
>> (seems unlikely).
>
> Right. The one I am talking about is at
> http://libalchemist.sourceforge.net/
>

ok.

>> my lib is written in C, and as a general rule has not been "micro-turned
>> for
>> max performance" or anything like this (and also is built with MSVC, with
>> debug settings).
>
> Right, I forgot to mention my compiler settings. The results I posted
> are using gcc 4.4.1-4ubuntu9, with -march-native -pipe -Wall -s -O3
> -fPIC. So that's with quite a lot of optimization, although the code for
> Alchemist hasn't been optimized for performance at all.
>

yeah.

MSVC's performance generally falls behind GCC's in my tests anyways...

>> emitting each byte is still a function call, and may check for things
>> like
>> the need to expand the buffer, ...
>
> I expect that this may be costly, especially with debug settings enabled.
> Alchemist doesn't make a function call for each byte emitted and doesn't
> automatically expand the buffer, but it does perform a range check.
>

the range check is used, and typically realloc is used if the buffer needs
to expand.
the default initial buffers are 4kB and expand by a factor of 1.5, and with
the example I am using this shouldn't be an issue.

>> the output is still packaged into COFF objects (though little related to
>> COFF is all that notable on the profiler).
>
> Right. Alchemist doesn't know anything about object file formats. It just
> gives you the raw machine code.

yep, and mine produces objects which will be presumably passed to the
dynamic linker (but other common uses include writing them to files, ...).

my tests have typically excluded the dynamic linker, as it doesn't seem to
figure heavily in the benchmarks, would be difficult to benchmark, and also
tends to crash after relinking the same module into the image more than a
few k times in a row (I suspect it is likely using up too much memory or
similar...).

>> there is very little per-instruction logic (such as instruction-specific
>> emitters), since this is ugly and would have made the thing larger and
>> more
>> complicated (but, granted, it would have been technically faster).
>
> That may be a major difference, too. Alchemist has different functions for
> emitting different kinds of instruction. For reference, the code that
> emits the "or ebx,byte +0x2a" instruction above looks like this:
>
> /* or ebx, 42 */
> n += cg_x86_emit_reg32_imm8_instr(code + n,
> sizeof(code) - n,
> CG_X86_OP_OR,
> CG_X86_REG_EBX,
> 42);
>
> There are other functions for emitting code, with names like
> cg_x86_emit_reg32_reg32_instr, cg_x86_emit_imm8_instr, etc.
>
> Each of these functions contains a switch statement that looks at the
> operation (an int) and then calls an instruction-format-specific function,
> substituting the actual x86 opcode for the symbolic constant. A similar
> scheme is used to translate the symbolic constant for a register name to
> an actual x86 register code.
>
> You can take a look at
> http://repo.or.cz/w/alchemist.git/blob/143561d2347d492c570cde96481bac725042186c:/x86/lib/x86.c
> for all the gory details, if you like.
>

mine works somewhat differently then.

in my case, the opcode number is used, and then the specific form of the
instruction for the given arguments is looked up (typically using
predicate-based matchers), and this results in a string which tells how to
emit the bytes for the opcode.

this string is passed to the "OutBodyBytes" function, which follows the
commands in the string (typically single letters telling where to put
size/addr/REX/... prefixes, apart for XOP and AVX instructions which are
special and may use several additional characters to define the specific
prefix), and outputs literal bytes (typically represented in the command
string as hex values).

each byte is emitted via "OutByte", which deals with matters of putting the
byte into the correct section, checking if the buffer for that section needs
to expand, ...

or, IOW, it is a more generic assembler...

Waldek Hebisch

unread,

Mar 29, 2010, 4:50:51 PM3/29/10

to

In comp.lang.misc BGB / cr88192 <cr8...@hotmail.com> wrote:
>
> "cr88192" <cr8...@hotmail.com> wrote in message
> news:hofus4$a0r$1...@news.albasani.net...
> > well, this was a recent argument on comp.compilers, but I figured it may
> > make some sense in a "freer" context.
> >
>
> well, a status update:
> 1.94 MB/s is the speed which can be gained with "normal" operation (textual
> interface, preprocessor, jump optimization, ...);
> 5.28 MB/s can be gained via "fast" mode, which bypasses the preprocessor and
> forces single-pass assembly.
>
>
> 10MB/s (analogue) can be gained by using a direct binary interface (newly
> added).
> in the case of this mode, most of the profile time goes into a few predicate
> functions, and also the function for emitting opcode bytes. somehow, I don't
> think it is likely to be getting that much faster.
>
> stated another way: 643073 opcodes/second, or about 1.56us/op.
> calculating from CPU speed, this is around 3604 clock cycles / opcode (CPU =
> 2.31 GHz).
>

For a litte comparison: Poplog needs 0.24s to compile about
20000 lines of high-level code generating about 2.4 MB of
image. Only part of generated image is instructions, rest
is data and relocation info. Conservative estimate is about
10 machine instructions per high-level line, which gives
about 200000 instructions, that is about 800000 istructions
per second.

Poplog generates machine code from binary intermediate form
(slightly higher level than assembler, typically one
intermediate operation generates 1-3 machine instructions).
Code is generated in multiple passes, at least two, in next
to last pass code generator computes size of code, then
buffer of appropriate size is allocated and in final pass
code is emmited to the buffer.

Code generator can not generate arbitrary x86 instructions,
just the ones needed to express intermediate operations.
Bytes are emmited via function calls, opcodes and modes
are symbolic constants (textual in source, but integers
in compiled form).

My feeling is that trying to use strings as intermediate form
(or even "character based dispatch") would significantly
slow down code generator and the whole compiler.

BTW: I tried this on 2.4 GHz Core 2. The machine is quad
core, but Poplog uses only one. L2 cache is 4MB per two cores
(one pair of cores shares one cache on one die, another pair
of cores is on second die and has its own cache). IME Core 2
is significantly (about 20-30% faster than similarly clocked
Athlon 64 (I have no comparison with newer AMD processors)),
so the results are not directly comparable with yours.

--
Waldek Hebisch
heb...@math.uni.wroc.pl

BGB / cr88192

unread,

Mar 29, 2010, 6:54:20 PM3/29/10

to

"Waldek Hebisch" <heb...@math.uni.wroc.pl> wrote in message
news:hor3rb$jv2$1...@z-news.wcss.wroc.pl...

ok.

> Poplog generates machine code from binary intermediate form
> (slightly higher level than assembler, typically one
> intermediate operation generates 1-3 machine instructions).
> Code is generated in multiple passes, at least two, in next
> to last pass code generator computes size of code, then
> buffer of appropriate size is allocated and in final pass
> code is emmited to the buffer.
>
> Code generator can not generate arbitrary x86 instructions,
> just the ones needed to express intermediate operations.
> Bytes are emmited via function calls, opcodes and modes
> are symbolic constants (textual in source, but integers
> in compiled form).
>

granted, direct byte-for-byte output is much faster than what I am doing,

> My feeling is that trying to use strings as intermediate form
> (or even "character based dispatch") would significantly
> slow down code generator and the whole compiler.
>

it depends a lot though as to how much of the overall time would actually go
into this.
text is a lot more expensive in cases where little else is going on, but is
a bit cheaper in cases where there is a large amount of logic code in the
mix.

in the case of an assembler though, the amount of internal logic is
comparatively smaller, and so string-processing tasks are overall more
expensive...

but, the bigger question here is not which is faster, but rather which
offers a better set of tradeoffs.

direct binary APIs tend to be far less generic than an assembler, for
example, they will be specialized to a particular code generator, ... and so
not as useful for general-purpose tasks (say, multiple code generators using
the same assembler, some input coming from files, ...).

it is much like how XML is not as fast to work with as S-Expressions either,
but XML is more flexible, thus making it more favorable despite its slower
speeds.

> BTW: I tried this on 2.4 GHz Core 2. The machine is quad
> core, but Poplog uses only one. L2 cache is 4MB per two cores
> (one pair of cores shares one cache on one die, another pair
> of cores is on second die and has its own cache). IME Core 2
> is significantly (about 20-30% faster than similarly clocked
> Athlon 64 (I have no comparison with newer AMD processors)),
> so the results are not directly comparable with yours.
>

yeah.

I am not so familiar with Poplog either though.

> --
> Waldek Hebisch
> heb...@math.uni.wroc.pl

Waldek Hebisch

unread,

Mar 30, 2010, 5:17:49 PM3/30/10

to

Poplog has code generators for x86 (32 and 64 bit), Alpha (64 bit),
M68000, Mips, Sparc, PPC, HPPA (last 4 are 32-bit). Front ends
for Pop11 ("native" language of Poplog), Lisp, SML, Prolog. In
principle intermediate representation should work for Pascal
(granted, a bit unusual due to garbage collection). C would
pose problems, some because of pointer artihmetic, bigger
one because casts between pointers and integers would interact badly
with garbage collector. Still, I would say that Poplog interface is
very generic.

Of course, binary interface is not so great if you want to
create real assembler (or support inline assembly).

--
Waldek Hebisch
heb...@math.uni.wroc.pl

cr88192

unread,

Apr 1, 2010, 11:40:53 AM4/1/10

to

"Waldek Hebisch" <heb...@math.uni.wroc.pl> wrote in message

news:hotppt$qro$1...@z-news.wcss.wroc.pl...

> In comp.lang.misc BGB / cr88192 <cr8...@hotmail.com> wrote:
>>

<snip>

>>
>> but, the bigger question here is not which is faster, but rather which
>> offers a better set of tradeoffs.
>>
>> direct binary APIs tend to be far less generic than an assembler, for
>> example, they will be specialized to a particular code generator, ... and
>> so
>> not as useful for general-purpose tasks (say, multiple code generators
>> using
>> the same assembler, some input coming from files, ...).
>>
>
> Poplog has code generators for x86 (32 and 64 bit), Alpha (64 bit),
> M68000, Mips, Sparc, PPC, HPPA (last 4 are 32-bit). Front ends
> for Pop11 ("native" language of Poplog), Lisp, SML, Prolog. In
> principle intermediate representation should work for Pascal
> (granted, a bit unusual due to garbage collection). C would
> pose problems, some because of pointer artihmetic, bigger
> one because casts between pointers and integers would interact badly
> with garbage collector. Still, I would say that Poplog interface is
> very generic.
>

yeah.

my assembler is a generic assembler, and so can more or less handle any sort
of ASM one wants to generate, which includes nearly the entire x86 ISA
(excluding of course some older and system instructions which are unlikely
to be needed, and most of AVX, though mostly because there aren't really
many processors around which support AVX as of yet...).

I don't support other ISA's (such as PPC or ARM), mostly because I don't
have any systems which support them (and I am not going to bother to target
an emulator).

adapting my assembler to PPC wouldn't likely be difficult (new ISA listings,
different opcode-encoding rules, ...).
adapting my compiler to PPC would be a little more work, but not as much
since a lot of generalization already took place when adapting it to x86-64.

ARM is likely to be similar to the PPC case.

> Of course, binary interface is not so great if you want to
> create real assembler (or support inline assembly).
>

the newer binary API for my assembler can also handle textual ASM fragments
as well, although one has to compose them fully beforehand (no printf-style
interface...).

however, at the moment I am less likely to use it as it is less convinient,
and also because my codegens still don't produce large volumes of ASM...

or such...

wolfgang kern

unread,

Apr 1, 2010, 2:47:37 PM4/1/10

to

"cr88192" wrote

[about..]
I followed this thread with interest...
and I can only add:

never have 'programmers convienience' above
any final clients demands !!!

Sure, fast(cheap) programming may serve a custumer well,
but on the daily usage of it he may you curse more often
than I curse Bill the Greedy every day.

So I actually wont care the time to assemble a piece of code
as long the end-user get what he expect in terms of reliabilty
and performance. [C-\+users and friends may see this different].

__
wolfgang

cr88192

unread,

Apr 1, 2010, 3:29:48 PM4/1/10

to

"wolfgang kern" <now...@never.at> wrote in message
news:hp2prm$ki5$1...@newsreader2.utanet.at...

well, the simple answer for this one is for people not to write apps in
Python...

Python is convinient for some developers, but many of the apps I have had
the most frustration with have been Python apps (poor performance, randomly
failing due to runtime errors, ...).

my assembler, OTOH, is fast enough not to bog down my stuff, and for my uses
this is probably good enough.

simplicity and flexibility is sometimes better than raw performance.

for example, consider how many people write apps in languages such as Java
and C#, which due to the combination of language design and coding practices
almost invariably run slower than would be otherwise possible, but in many
of these cases the performance overhead can be justified due to the improved
maintainability of using these features.

the other extreme is unmaintainable code which may run a slight bit faster,
but this is not necessarily a good tradeoff...

or such...

wolfgang kern

unread,

Apr 6, 2010, 3:37:02 PM4/6/10

to

"cr88192" <iad:
>> I wrote

>> [about..]
>> I followed this thread with interest...
>> and I can only add:

>> never have 'programmers convienience' above
>> any final clients demands !!!

>> Sure, fast(cheap) programming may serve a custumer well,
>> but on the daily usage of it he may you curse more often
>> than I curse Bill the Greedy every day.

>> So I actually wont care the time to assemble a piece of code
>> as long the end-user get what he expect in terms of reliabilty
>> and performance. [C-\+users and friends may see this different].

> well, the simple answer for this one is for people not to write apps in
> Python...

> Python is convinient for some developers, but many of the apps I have had
> the most frustration with have been Python apps (poor performance,
> randomly failing due to runtime errors, ...).

obviously true :)

> my assembler, OTOH, is fast enough not to bog down my stuff, and for my
> uses this is probably good enough.
> simplicity and flexibility is sometimes better than raw performance.
> for example, consider how many people write apps in languages such as Java

> and C#, which due to the combination of language design and coding ?

There is actually a very huge gap between 'working code'
(which is usually heavyly bloated) against the true machine-relevant
oriented way of using hardware.

> practices almost invariably run slower than would be otherwise possible,
> but in many of these cases the performance overhead can be justified due
> to the improved maintainability of using these features.

The term 'maintainabilty' (in my point of view) just contains a
foreseen bug-recovery or anything to change because not known
during the programming phase.

I stricly avoid such situatuons by accepting the final users demand
as a given task...

> the other extreme is unmaintainable code which may run a slight bit
> faster, > but this is not necessarily a good tradeoff...

[OR SUCH..]

Maintainability of any code is just a matter of documentation/comments,
but my long term experience with sold software told me, that a working
program wont, never ever need any service nor upgrade or "maintain".

btw: never act like M$ and sell BUGS as if they where security issues.

__
wolfgang

Nathan Baker

unread,

Apr 8, 2010, 3:13:05 PM4/8/10

to

"cr88192" <cr8...@hotmail.com> wrote in message

news:hp2s82$c8o$1...@news.albasani.net...

>
> "wolfgang kern" <now...@never.at> wrote in message
> news:hp2prm$ki5$1...@newsreader2.utanet.at...
>>

>> So I actually wont care the time to assemble a piece of code
>> as long the end-user get what he expect in terms of reliabilty
>> and performance. [C-\+users and friends may see this different].
>>
>
> well, the simple answer for this one is for people not to write apps in
> Python...
>
> Python is convinient for some developers, but many of the apps I have had
> the most frustration with have been Python apps (poor performance,
> randomly failing due to runtime errors, ...).
>

Scripting languages provide beneficial aspects to a specific set of use
cases -- situations where using compiled (or hand-coded ASM) code would be a
ridiculous choice.

That said, I do see a small trend back to a performance-oriented development
approach -- due to failure of CPU makers to address the issue.

One, Google developed a new 'system programming language' called "Go" which
brings some scripting-language features to the table.. amoung other goodies.

Two, Facebook developed a compiler for a derivative of PHP... and it
outperforms existing JIT implementations.

Nathan.

James Harris

unread,

Apr 8, 2010, 4:51:56 PM4/8/10

to

On 8 Apr, 20:13, "Nathan Baker" <nathancba...@gmail.com> wrote:

...

> That said, I do see a small trend back to a performance-oriented development
> approach -- due to failure of CPU makers to address the issue.

What have the CPU makers not done?

James

Nathan Baker

unread,

Apr 8, 2010, 5:48:45 PM4/8/10

to

There was a time when, every few years, the CPU clock speed seemed to be
increasing geometrically...

1 MHz
8 MHz
25 MHz
90 MHz
133 MHz
500 MHz

But when they got into the GHz range, the trend stopped. There is no longer
a "user noticable" increase in performance gained by purchasing new
hardware. This places the burden on software.

Nathan.

BGB / cr88192

unread,

Apr 8, 2010, 9:44:59 PM4/8/10

to

"Nathan Baker" <nathan...@gmail.com> wrote in message
news:hpl9rv$a46$1...@speranza.aioe.org...

> "cr88192" <cr8...@hotmail.com> wrote in message
> news:hp2s82$c8o$1...@news.albasani.net...
>>
>> "wolfgang kern" <now...@never.at> wrote in message
>> news:hp2prm$ki5$1...@newsreader2.utanet.at...
>>>
>>> So I actually wont care the time to assemble a piece of code
>>> as long the end-user get what he expect in terms of reliabilty
>>> and performance. [C-\+users and friends may see this different].
>>>
>>
>> well, the simple answer for this one is for people not to write apps in
>> Python...
>>
>> Python is convinient for some developers, but many of the apps I have had
>> the most frustration with have been Python apps (poor performance,
>> randomly failing due to runtime errors, ...).
>>
>
> Scripting languages provide beneficial aspects to a specific set of use
> cases -- situations where using compiled (or hand-coded ASM) code would be
> a ridiculous choice.
>

fair enough.

> That said, I do see a small trend back to a performance-oriented
> development approach -- due to failure of CPU makers to address the issue.
>

Python is not ONLY slow...
it has a bad tendency to delay far too many problems until runtime, so the
app randomly dies often due to trivial crap like type-check errors or
passing the wrong number of arguments to a method (or, a non-existing
method).

the other extreme is using C# in Visual Studio, where typically one sees
warnings and error-messages pop up and disappear in real-time (combined with
underlining, ... like is MS Word).

admittedly, this is something that VS has done well, and very likely
outweighs the relative cost (vs Python) of having to use static types and an
overall extremely fussy compiler...

admittedly, I spent a good deal of effort (earlier today) trying to figure
out how to do the equivalent of "dynamic_cast<>" or "instanceof". in C#,
there is no nifty syntax, and I suspect that likely it is overall a more
expensive operation.

if(typeof(Foo).IsInstanceOf(obj))
...

vs:
if(obj instanceof Foo)
...

and a method-call vs an opcode (in the JVM), or in my case an internal
operation itself likely to be fairly cheap. admitted, given that MSIL it
JIT'ed, it is likely the overall cost of a method call is fairly low (vs,
for example, the cost of checking for inheritence, ...).

> One, Google developed a new 'system programming language' called "Go"
> which brings some scripting-language features to the table.. amoung other
> goodies.
>
> Two, Facebook developed a compiler for a derivative of PHP... and it
> outperforms existing JIT implementations.
>

yes, ok.

BGB / cr88192

unread,

Apr 8, 2010, 10:11:45 PM4/8/10

to

"Nathan Baker" <nathan...@gmail.com> wrote in message

news:hplivq$q2c$1...@speranza.aioe.org...

yep, we can no longer "make up the difference" writting horridly slow code
in horridly slow VMs and expecting the users' PC to be so much faster as to
minimalize the performance issues...

now, people writing software actually have to optimize and crap again...

but, there is some improvement to be had:
I can get a new computer with a quad-core processor, rather than my current
one with a dual-core, and also memory bandwidth is increasing, and
hard-drives are getting bigger, ...

admitted, now it is like minor improvements after 3 years, vs massive
improvements over 3 years...

I wonder if the claim that PC's will be powerful enough to simulate a human
consciousness by 2019 still holds?... I sort of doubt it, unless PC's start
really picking up again, even if it could work in the first place.

then again, there was a game from a few years ago, which was set in a year
like 2012 or so, and which looks a little dated now, for example, because
flat-panel displays are still treated as a novelty and with most people
still using CRT's, whereas it is 2010 now and CRT's have apparently largely
disappeared much more quickly than this, ...

making the game more appear to be set in 2007 or 2008, errm... and also
having been released in this year.
hrrm... they didn't bother with futurism at all...

or such...

James Harris

unread,

Apr 9, 2010, 6:10:17 AM4/9/10

to

Forgive me but from at least two perspectives that's an extraordinary
viewpoint. I thought you might be referring to compilers and other
development tools.

First, when an industry has progressed geometrically (your term) and
*sustained* that improvement for such a long period of time

http://en.wikipedia.org/wiki/Wheat_and_chessboard_problem

to refer to the "failure of CPU makers to address the issue" says they
have somehow failed. I would call what they have accomplished a
success.

Second, the user-noticeable increase in performance (your term) has
lagged way behind hardware improvements. AIUI the problem is software.
User interfaces and inefficient code continue to swallow up much of
the increases in hardware speeds. So rather than hardware placing the
burden on software, the burden is software.

On topic for at least one of the crossposted groups (comp.lang.misc)
we ought to be able to design a language that can exploit and benefit
from the amazing hardware we programmers now have available.

We've never had it so good! For example, as mentioned at

http://en.wikipedia.org/wiki/Cray-1

one of the early supercomputers

* weighed 5 or 6 tons
* consumed over 100 kilowatts of power
* cost millions
* and had a cycle time of 12.5 ns

That means it ran at a lowly 80 MHz. I know they don't have vector
units etc but even mobile phones clock faster than that now.

James

Robert Redelmeier

unread,

Apr 9, 2010, 10:58:23 AM4/9/10

to

In alt.lang.asm Nathan Baker <nathan...@gmail.com> wrote in part:

> There was a time when, every few years, the CPU clock speed
> seemed to be increasing geometrically...
>
> 1 MHz > 8 MHz > 25 MHz > 90 MHz > 133 MHz > 500 MHz
>
> But when they got into the GHz range, the trend stopped. There is
> no longer a "user noticable" increase in performance gained by
> purchasing new hardware. This places the burden on software.

Even outside of niches (video compression is still CPU-bound), this
really is not true. The trend stopped mostly because it didn't pay.
CPU time has dwindled to insignificance for common tasks, but other
things (video rendering, disk seek, network response) have not.

People can see signficant improvements in some tasks
with appropriate hardware upgrades like videocards and SSDs.
MS-Windows can boot in 10-15s with the latter.

As for fixing software, fast CPUs _reduce_ the cost of locally
suboptimal code. What matters is high-level optimization.

As an example, Firefox can be made to open much quicker if you
disable automatic checking for updates and set it to open on a
blank page. Perhaps Firefox should delay update checking.
This makes a bigger difference than a 2.0 vs 2.5 GHz CPU speed.

-- Robert

>
>

BGB / cr88192

unread,

Apr 9, 2010, 12:46:43 PM4/9/10

to

"Robert Redelmeier" <red...@ev1.net.invalid> wrote in message
news:hpnfaf$qa7$1...@speranza.aioe.org...

this is in part why I don't worry that much that the way I usually use my
assembler is not the fastest possible...
the amount of code I have to feed it before performance matters is
"massive", and by this point I would likely have much bigger concerns.

there is also my metadata database, which I optimized mostly because
DB-loading was bogging down startup times, but during running (or code
compilation) it is not a signigicant time user (the compiler generally
pulls/sends info to/from its own internal representations).

most time still generally goes into whatever the app is doing, rather than
internal maintainence costs, and so it typically matters more that the
runtime is lightweight, rather than the codegen tasks being "as fast as
possible".

although, granted, my C compiler is still a bit slow to really be used for
some things I had originally imagined (producing C fragments at run-time and
eval'ing them...).

hence, I am changing my more recent strategy to using a "C-enabled"
script-language (BGBScript) for things more eval-related (if I can access
the C toplevel and typesystem without too much hassles, this is looking good
enough...).

the BGBScript route looks much more promising at the moment than the Java or
C# routes.

however, BGBScript still needs a bit more work here:
better integration with the external typesystems (both C and JVM-style);
more solidly defining the object and scoping models;
...

note that the language may end up being mixed-type (using both static and
dynamic type semantics).

var x; //dynamically typed variable
var i:int; //integer variable
var a:int[]; //integer array (likely Java/C# style).
var pi:int*; //pointer-to-integer

but, all this could lead to issues...

note:
x=i; //ok, 'i' converted to a fixint
x=a; //ok, JVM-style arrays convert fine
x=pi; //issue, may need to box pointer

i=x; //issue, needs typecheck
a=x; //issue, needs typecheck
pi=x; //issue, several possible issues

pi=a; //should work ok

in C#, doing stuff like the above needs casts, but a lot may depend some on
the "policy" of the language (C# likes to be overly strict, and requires
casts for damn near everything).

BGBScript, being intended mostly as a scripting language, will probably be
far more lax and not balk unless there really is a problem (as in, the
operation either doesn't make sense or can't be safely completed).

unlike in C or C++, types may not be complex type declarations.

all types will thus have a fixed form:
<modifiers> <typename> <specifiers>

another issue is whether Prototype OO or Class/Instance OO should be the
"default" model (either way, I am likely to support both models).

P-OO is likely to remain dynamically-typed (statically-typed P-OO is likely
to be more hassle than it is worth).

Class/Instance is likely to allow mixed-typing, and will use a Java/C# style
object model.

...

or such...

Maxim S. Shatskih

unread,

Apr 9, 2010, 10:17:01 PM4/9/10

to

> Python is not ONLY slow...

From what I know on Python, it is a junk language worse then both Perl and PHP.

--
Maxim S. Shatskih
Windows DDK MVP
ma...@storagecraft.com
http://www.storagecraft.com

BGB / cr88192

unread,

Apr 9, 2010, 10:35:52 PM4/9/10

to

"Maxim S. Shatskih" <ma...@storagecraft.com.no.spam> wrote in message
news:hpon2u$2lon$1...@news.mtu.ru...

> Python is not ONLY slow...

<--

From what I know on Python, it is a junk language worse then both Perl and
PHP.
-->

I will not exactly disagree with this.
but, a lot of people like it, for whatever reason...

Nathan Baker

unread,

Apr 9, 2010, 11:37:56 PM4/9/10

to

"James Harris" <james.h...@googlemail.com> wrote in message
news:292e7385-9a2b-4acd...@x12g2000yqx.googlegroups.com...

On 8 Apr, 22:48, "Nathan Baker" <nathancba...@gmail.com> wrote:
>
> Forgive me but from at least two perspectives that's an extraordinary
> viewpoint. I thought you might be referring to compilers and other
> development tools.

Oh, yeah, right. I *did* originally respond to a dev-tool issue, so why did
I introduce hardware into the picture??

Well, maybe it is a 'strawman', but it sure does seem to be a popular
'excuse/justification' for developer's choices on both sides of the aisle.
Examples:

o "Netbook/phone/gadget resources are restricted and are clocked slow,
therefore we are forced to develop using system languages!"
o "Modern desktops are *so* ahead of last decade's and are *so* resource
rich, we'd be fools not to develop using scripting languages!"

Well, software _is_ dependent on hardware, so I think it will forever be a
talking point.

> On topic for at least one of the crossposted groups (comp.lang.misc)
> we ought to be able to design a language that can exploit and benefit
> from the amazing hardware we programmers now have available.

Doesn't http://golang.org/ do that?? If not, in what areas is it lacking?

Nathan.

James Harris

unread,

Apr 10, 2010, 4:22:06 AM4/10/10

to

On 10 Apr, 03:17, "Maxim S. Shatskih" <ma...@storagecraft.com.no.spam>
wrote:

(As this is topical primarily only in comp.lang.misc I have removed
a.l.a and a.o.d from followups. Feel free to add back if you
disagree.)

> > Python is not ONLY slow...
>
> From what I know on Python, it is a junk language worse then both Perl and PHP.

Any language has its good points and bad points. Whether Python is
good or bad probably depends on

1. What a programmer is trying to use it for. If we can't put on
wallpaper with a spanner does that mean a spanner is useless? (An
extreme example, I know.)

2. Whether the philosophy of the language matches the programmer's
approach or way of thinking.

As Brendan points out, a lot of people like Python. To me it's great
for doing some things. It is *much* more readable than Perl, as in the
following.

http://codewiki.wikispaces.com/ip_checksum.py

It is about the smallest example I've put online. As you can see, in
some respects it's very C-ish.

James

Rod Pemberton

unread,

Apr 10, 2010, 6:10:46 AM4/10/10

to

"Robert Redelmeier" <red...@ev1.net.invalid> wrote in message
news:hpnfaf$qa7$1...@speranza.aioe.org...

> In alt.lang.asm Nathan Baker <nathan...@gmail.com> wrote in part:
> > There was a time when, every few years, the CPU clock speed
> > seemed to be increasing geometrically...
> >
> > 1 MHz > 8 MHz > 25 MHz > 90 MHz > 133 MHz > 500 MHz
> >
> > But when they got into the GHz range, the trend stopped. There is
> > no longer a "user noticable" increase in performance gained by
> > purchasing new hardware. This places the burden on software.
>
> Even outside of niches (video compression is still CPU-bound), this
> really is not true. The trend stopped mostly because it didn't pay.
> CPU time has dwindled to insignificance for common tasks, but other
> things (video rendering, disk seek, network response) have not.
>

Yes, the Amiga PC model, using a processor with multiple coprocessors, was a
brilliant advancement, wasn't it? It's too bad PC's are still struggling to
adopt the model...

I was going to ask NB when multi-core x86 production started. Wasn't it
right were the clock speed doubling stops? i.e., around 1GHz?

Rod Pemberton

Robert Redelmeier

unread,

Apr 10, 2010, 2:24:01 PM4/10/10

to

In alt.lang.asm Rod Pemberton <do_no...@havenone.cmm> wrote in part:

> "Robert Redelmeier" <red...@ev1.net.invalid> wrote in message

>> In alt.lang.asm Nathan Baker <nathan...@gmail.com> wrote in part:
>> > There was a time when, every few years, the CPU clock speed
>> > seemed to be increasing geometrically...
>> >
>> > 1 MHz > 8 MHz > 25 MHz > 90 MHz > 133 MHz > 500 MHz
>> >
>> > But when they got into the GHz range, the trend stopped. There is
>> > no longer a "user noticable" increase in performance gained by
>> > purchasing new hardware. This places the burden on software.
>>
>> Even outside of niches (video compression is still CPU-bound), this
>> really is not true. The trend stopped mostly because it didn't pay.
>> CPU time has dwindled to insignificance for common tasks, but other
>> things (video rendering, disk seek, network response) have not.
>>

> Yes, the Amiga PC model, using a processor with multiple
> coprocessors, was a brilliant advancement, wasn't it?
> It's too bad PC's are still struggling to adopt the model...

The original PC also had numerous support chips and a coprocessor
8087 , and Intel produced an 8089 IO coprocessor but I'm not
aware of it being used. Intel shyed away from multiprocessing
(IIRC) for many years from the iAPX/432 failure.

> I was going to ask NB when multi-core x86 production started.
> Wasn't it right were the clock speed doubling stops? i.e.,
> around 1GHz?

More like 2-3 GHz, but your point remains good. Higher clocks are
certainly possible on simpler cores (Pentium4) but that drops IPC.

Simplifying somewhat, the real problem is local heat -- a shrink
can run faster, speed increases linearly but the area available
for heat removal decreases by the square. For many years,
voltage could also drop with process shrinks so that reduced
heat generation. But not much more which stalled clocks.

So the additional xtors get used for multicore which spreads the
heat out. Moore's Law gives you xtors but says nothing about speed.

-- Robert

Maxim S. Shatskih

unread,

Apr 11, 2010, 3:24:22 PM4/11/10

to

> More like 2-3 GHz, but your point remains good. Higher clocks are
> certainly possible on simpler cores (Pentium4) but that drops IPC.

P4 was the most complex core ever made, that's why it was discontinued and replaced with the desktop versions of Pentium-M (which was more-or-less advanced P-III Mobile) under the name "Core".

Robert Redelmeier

unread,

Apr 11, 2010, 3:54:11 PM4/11/10

to

In alt.lang.asm Maxim S. Shatskih <ma...@storagecraft.com.no.spam> wrote in part:

> P4 was the most complex core ever made, that's why it was discontinued
> and replaced with the desktop versions of Pentium-M (which was
> more-or-less advanced P-III Mobile) under the name "Core".

Which itself is basically a tweaked PentiumPro.

I grant you the P4 had lots of xtors for the deep pipelining,
but as a dual-issue was not as complex as triple-issue CPUs
(iPPro...iCore & K6...Phenom), at least from a clock-synch PoV.

P4 died from the poor IPC. As a stopgap after Itanium cratered,
I don't think it was expected to survive.

-- Robert

Phil Carmody

unread,

Apr 11, 2010, 6:11:04 PM4/11/10

to

It was an experiment with more extreme architectural elements
than its intel predecessors (pipeline depth, #ooo in flight,
double-pumped ALUs), but seems to have suffered from some
contradictory cost-cutting (ALUs being able to handle 4 IPC,
but only 3 could be physically fed to them per tick, sub-par
caches, etc.). I think they were hoping they could build on
it, as it went through several revisions, but it only clicked
later that it was a dead-end mess.

All IIRC, which, given how long ago it was, is unlikely.

Phil
--
I find the easiest thing to do is to k/f myself and just troll away
-- David Melville on r.a.s.f1