"Markus Wichmann" <
null...@nospicedham.gmx.net> wrote in message
news:bitap8-...@voyager.wichi.de.vu...
> On 15.11.2011 14:11, Rod Pemberton wrote:
...
> > 00000000 14A5 adc al,0xa5
> > 00000002 80D3A5 adc bl,0xa5
> > 00000005 80D1A5 adc cl,0xa5
> > 00000008 80D2A5 adc dl,0xa5
> >
>
> When is it a problem that nasm generates _less_ code than MASM?
It's a problem when you need to have the exact byte sequence for one form to
be generated, but can't generate it.
E.g. #1, let's say you're recreating source for a program and it uses that
instruction, but it's not available in your assembler. What do you do?
Hard code it? What that if the instruction has an offset, and that offset
gets changed with changes to the source? Do you manually correct the
hardcoded instruction each time? Do you create a self-modifying patch?
E.g. #2, let's say you're testing an x86 interpreter. How do you use NASM
to generate the code to test the missing long form? What if the 014h form
is not implemented yet? So, you either hardcode the missing instruction,
binary edit a file to add it, or perhaps code a C program to emit it ...
Not having all forms of instruction encodings available complicates porting,
patching, and source code recreation.
I am of the opinion, as are many others who program in assembly, that I
should be able to generate all forms of an instruction. As you become more
involved you become with programming in assembly, there will be a time an
place that someone will need that functionality.
> AFAIK
> the short forms execute faster. Plus you get more code into the cache
> that way. If that form of the instruction destroys some form of
> alignment, you can easily enforce this with the "align" macro (which by
> default insert's 90h's)
>
True.
> > NASM will always generate the short form (14h) for AL. How do you tell
> > NASM to generate the 80h form for AL? AFAIK, you can't. It's possible
> > it's been added to NASM. I haven't read newer NASM manuals.
> >
>
> Well, if you do know the exact machine code you want to have, you might
> as well tell NASM about it:
>
> db 80h, 0d0h, 0a5h
>
That's exactly what is not wanted. We were discussing unimplemented corner
cases. That's an unimplemented corner case.
> However, assemblers were originally created so programmers
> wouldn't have to fiddle with the specifics of machine code. That's why
> assemblers do have some freedom when choosing their encoding.
How do you choose exactly? That's the point. Without assembler syntax for
it, you can't. You can only hardcode it, as above with db ...
I, and probably many others, have run into "boundary cases". NASM is far
better in regards to this for x86, but not complete. You can get NASM to
generate long form of instructions that GAS and WASM won't. I'm not sure
about other assemblers.
> > If you ever get around to creating assembly code for some binary, you'll
> > likely attempt to check it by comparing the resulting binaries.
>
> I really don't know of any way this could possibly disturb you, _ever_.
> Especially, I don't know why I would want to write an assembler source
> file that generates the _exact_ same output as some existing file.
To ensure that the disassembled source for one assembler is equivalent to
the original source of another assembler, you want to check that the new
binary identical to the original. If it's not, you'll have to check each
difference by disassembling. There could be thousands of them. Also, if a
some instruction assembles to a form using a larger offset, then all code
and data that comes after is off by a byte(s), and then the recreated source
won't work. You want check as few differences as possible. You don't want
to be checking numerous different but equivalent "reg, reg" encodings.
You'd like to only check encoding which could break the program, i.e., those
with differently sized offsets, such as jumps and branches. But, if you've
got many "reg, reg" encodings to search through, you can't easily locate
wrongly sized branches and jumps. Also, if you can't select different sizes
of instruction encodings, how do you fix it? Kludge ... Hardcode ...
Self-modifying code ... etc.
> In my book, "getting around to creating assembly code" happens only
> for few reasons; and trying to recreate an existing binary is not one of
> them.
Sometimes, programs need to be converted to another assembler, e.g., the
original assembler no longer exists or is hard to obtain. How do you
recompile? Let's say you've got a binary that can be disassembled, but you
need to be able to verify the recreated source is correct ... That means
comparing the resulting binary with the original. If you have a Public
Domain program as a binary but no source, e.g., DOS device driver, what do
you do? Let's say you need to update that program, or convert or re-use a
part of it.
> > If the
> > binary was created by MASM and you're using NASM for the recreation,
> > you'll come across branches that NASM encodes one way and MASM
> > encodes another way. I.e., look at Jcc for multiple encodings.
> >
>
> Looking at the manuals, I don't see much ambiguity. Or do you mean:
>
> 70 cb JO rel8off
> 0F 80 cw JO rel16off
> 0F 80 cd JO rel32off
>
> So you can encode a jump within 128 bytes in three ways?
Only one of those encodes a jump exclusively within 128 bytes ... But, yes,
the 0F 80 form can encode a short branch.
> NASM always chooses the smallest possible one. I don't
> see what's wrong with that behaviour.
Nothing is "wrong" with that behavior. All assemblers have a "default"
behavior.
Fortunately, NASM allows you to select between those forms using NEAR and
o16/o32. NASM doesn't allow you to select on the earlier AL or AX/EAX
forms.
E.g., one piece of assembly I converted required NASM's SHORT on a couple of
JMPs, BYTE on all the arithmetic, and WORD on many MOVs and all the
remaining JMPs. NASM allows you to chose those corner cases. It would've
been much harder if one had to db all of them. That same x86 code has stuff
like x86 instructions as labels: CLD, CMOVE. You can argue CMOVE wasn't
around at the time, but not CLD. So, that assembler preferenced branch
names over instructions, while NASM doesn't.
> > Get a hex editor. Pick an instruction that can have multiple equivalent
> > encodings (above). Manually construct the instructions as you believe
> > they are encoded. Many x86 instructions use 8-bits in octal as 2/3/3
> > for encoding registers and the memory mode. Disassemble.
>
> You do know that assemblers were invented so people wouldn't have to
> bother with machine code, right?
I first programmed in assembly in the early '80's for 6502 with an assembler
which was one-step beyond a text-editor ...
> Anyway, do you want to tell us the long way that there are
> different ways to encode the same instruction and you think
> NASM chooses the wrong ones?
That's not what I said.
James was asking about problems with corner cases with NASM, especially LGDT
and LIDT. There are instructions where NASM can't generate all forms. It's
fairly well known that NASM emits some instructions differently from MASM,
which causes problems with porting.
E.g., NASM will emit the first hex byte for these instructions, whereas some
other assembler (e.g., perhaps MASM, A86, or ASM86) emits the second:
or 09 vs. 0B
xor 31 vs. 33
cmp 39 vs. 3b
cmp 38 vs. 3a
mov 89 vs. 8b
mov 88 vs. 8a
sub 28 vs. 2b
etc.
Those are probably all "r8,r8" variations.
For that program, there are just over 150 binary differences, half of which
are instructions, in just 6KB. In a normal program, the quantity of
differences could be huge.
> > You may want other disassemblers besides NASM's NDISASM to
> > cross-check the disassembly. While NASM does a good job for
> > assembly, NDISASM doesn't (or didn't ...) disassemble everything
> > correctly.
>
> How come that again? I've yet to see a file ndisasm can't handle. Ugly
> hacks, like jumps into the middle of an instruction it can't do,
> granted, but it can do anything else.
...
> It even has an "intelligent sync" feature, allowing it to correctly decode
> an instruction/data mix. However, for that to work, the address of an
> instruction start has to occur as part of an instruction before the
> address itself. If that doesn't happen, there's also "manual sync".
It sounds like you're discussing Rosasm or the ancient x86 Bubble
disassembler ...
Which versions of NDISASM support this?
Anyway, "intelligent sync" as you described it, isn't guaranteed to work.
All that means is some found bytes were equivalent to an offset or address
within the address range you've chosen to disassemble. x86 has a complete
single-byte instruction map. So, there is no way to correctly sync via
programatic methods. Even with full program analysis of all possible
branches and possible entry points, it's still possible that programmatic
methods won't locate the correct disassembly, i.e., multiple found
instruction paths. When lots of data or text is added to the mix, or code
that is not used by the main program, or the program is short, the problem
becomes more difficult. If there is a code size switch, e.g., 32-bit
following 16-bit, it is "impossible" to determine programmaticly. The x86
16-bit and 32-bit encodings have a huge amount of overlap, but are slightly
different. That difference may be detectable, but wouldn't be easy which is
why I quoted impossible. 64-bit mode, you could probably do a frequency
analysis on the code for a REX prefix to guess that it's 64-bit code.
Sequences of ASCII text can be guesses if bit 7 is clear for bytes in a code
sequence. Anyway, the safe method is for the programmer to determine the
code entry points, e.g., by determining text data via a dump, entry location
via an object header, etc. If one entry point is known to be correct, you
can sync for a while ...
Rod Pemberton