How compilers implement objects in x86 assembly

daniel...@nospicedham.gmail.com

unread,

May 30, 2012, 7:45:31 PM5/30/12

to

Hi, I'm learning x86 assembly on my free time and I'm trying to find how does high level compilers implement classes and objects in x86 asm. So far I saw some toy compilers that use structs, but the way I understand it, structs are abstraction feature provided by assemblers like MASM e FASM. How would they implement classes and objects without structs? Arrays and offsets?

Rugxulo

unread,

May 30, 2012, 11:16:26 PM5/30/12

to

Hi,

Surely there are others more knowledgeable than I, but I'll chime in
anyways.

Function pointers. ;-) Well, at least this seems to be what gives
Modula-2 its status as "object-based" (though not OOP, by default).
Just stuff a few of those in a struct and you're halfway there, heh.

Oberon is the successor to Modula-2, and it had its own native
simplified OOP (later extended a bit with Oberon-2). You can read more
about that on Wikipedia or various sites, so I won't try to summarize
too much here.

Basically, Wirth said this:

= a class is a record type (struct) with both data and function
pointers
= an object is a variable (instance of aforementioned record)
= a method is just a procedure that operates on data in the object

Of course it gets more complicated due to trying to do various things
(virtual method table?). The OOP paradigm seems to mostly focus on
encapsulation, polymorphism, and inheritance. So you're basically
keeping track of some things behind the scenes (in lieu of explicit
type discriminant [?] of variant records) that are tested either
manually or "preferably" (?) at runtime.

Okay, so none of that was a great explanation, esp. since I barely
understand it myself, but you can download and read Dr. Mossenboch's
_OOP in O2_ book (very exhaustive) here:

http://ssw.jku.at/Research/Books/

BGB

unread,

May 31, 2012, 1:35:11 AM5/31/12

to

On 5/30/2012 6:45 PM, daniel...@nospicedham.gmail.com wrote:
> Hi, I'm learning x86 assembly on my free time and I'm trying to find how does high level compilers implement classes and objects in x86 asm. So far I saw some toy compilers that use structs, but the way I understand it, structs are abstraction feature provided by assemblers like MASM e FASM. How would they implement classes and objects without structs? Arrays and offsets?

typically, the operation will encode the offset of any fields being
accessed, which will often be hard coded in the output (a struct's
fields are basically just a set of offsets from some base-pointer anyways).

typically, the data will be organized similarly to a struct, and the
methods will be accessed via a vtable, which is basically an array of
function pointers (a call will encode the index into this table).

in my VM, it is a little different:
compiled code does not directly access objects, but will instead
generate references to handlers which will be generated at link time,
and which will deal with the matter of object layout.

currently, objects have headers which link to both the data and the
(current) class definition. this is partly because object layout is not
fixed in my VM, and there may exist multiple versions of a given class
(with differing physical layouts), at any given moment. the memory
layout itself resembles a structure though.

or such...

George Neuner

unread,

May 31, 2012, 3:41:22 PM5/31/12

to

A struct[ure] (also called a record) is simply a group of data offsets
related by a common base pointer. The code generator (or programmer)
obviously needs to know the layout, types and sizes of the fields, but
references to the fields then are simply pointer+offset.

The trick with objects is how to implement dynamic dispatch to
functions based on the object's type. There are 2 general ways to do
this: generic functions with parametric dispatch or member functions.
Imperitive languages tend to use member functions. FP languages tend
to prefer generic functions, although some use a blend of generic and
member functions.

Member Functions:

The very simplest way to implement member functions is to add to the
data struct a pointer field for each function. However, this isn't
very satisfactory because every object of the same "class" (i.e. the
same type) will need to have pointers to all the same functions.
Redundantly putting pointers into each object needlessly wastes space.

So the first optimization is to pull the common function pointers out
into a separate dispatch table, and to place a pointer to the dispatch
table into each object. This is the "class pointer" referred to in
the literature.

How best to implement the dispatch "table" depends on the type of
inheritance (none, single or multiple) and also on whether the
inheritance hierarchy is fixed or can change at run time. I'm not
going to discuss either multiple inheritance or how to modify the
hierarchy at run time ... those are topics for next semester after you
master single inheritance.

For single inheritance with the class hierarchy fixed at run time, the
simplest thing to do is to act like there is no inheritance and give
each class a complete table of its functions, even if they were
inherited from a superclass. More flexible, but also more involved,
is to keep the individual class dispatch tables separate and with each
dispatch table keep a pointer to the immediate superclass. The
difficulty using separate tables is having to follow the chain of
superclass pointers at run time to find the correct dispatch table. To
do this the code generator has to know the "distance" in the hierarchy
between the current class and the superclass containing the target
function definition.
[A speed optimization for separate tables is to use a "display" of
superclass pointers which exposes the entire root hierarchy instead of
a single pointer to the immediate superclass. A display permits
reaching any of the superclasses with a single pointer dereference.
However, you still need to know the "distance" to the function
definition to determine which of the display pointers to follow.]

Full function tables are the way to go if the OO implementation is
"classless" ... that is, there are OO languages which are "template"
based: you create a new object by cloning an existing one, and create
a new type of object by adding/deleting data fields and methods. WRT
class based OO, template OO doesn't necessarily require any additional
work at run time: all the "created-from" type relationships still can
be determined at compile time and so the layouts of the objects and
dispatch tables can be fixed.

Generic Functions:

Generic functions are groups overloaded functions (same name, unique
parameter types) dispatched on the *runtime* types of the arguments
(i.e. *not* like C++ overloads which dispatch on static types).

The general way of implementing them is to generate the code for each
typed overload, then to create an dispatch function which determines
the runtime types of its arguments, matches the call (if possible) to
the most specific of the typed overloads, and then executes the
overload passing through the arguments [ideally this should be a jump
to the overload code rather than an additional level of call nesting].

The runtime type of an object can be known by embedding a numeric
"class" id field in the structure (analogous to the class pointer in
the member function implementation). The id then is used as an index
into a (sparse) multi-dimensional table with one dimension for each
polymorphic parameter of the generic function. For each unique set
of parameter types, the address of the correct overload code to
execute is stored at the intersection of the dimension type indices.

The dispatch table only needs to consider *distinguishing* parameters
... i.e. if the nth parameter of every overload is a FOO, then the nth
argument is no help deciding which overload to execute [however the
dispatch code still needs to check that the nth argument of the call
actually *is* a FOO].

There are a couple "gotcha's" to watch out for. The set of overloads
may not cover the hierarchy: e.g., a FOO argument may have to be
matched to an overload which expects a superclass of FOO. So you
either need to represent the entire hierarchy in the dispatch table,
or you need an ordering of type ids such that you can know levels have
been skipped and the superclass id is the correct match. You also
need a fall back position for when the call arguments can't be matched
to any existing overload.

It may seem like generic functions are a lot more work, but in fact
the 2 methods for dynamic dispatch are essentially equivalent in
complexity ... I haven't talked about the difficulties of implementing
multiple inheritance with member functions. However, it should be
obvious that generic dispatch on a single parameter is (or at least
can be) of the same complexity as single inheritance member function
dispatch.

If you really are interested in this topic, you should get yourself a
book on compiler implementation (don't just look at examples).

FWIW: OO programming is a royal PITA when the language has no support
for it. IMO it's easier to write a compiler that generates assembler
for a high(er) level OO language than it is to program the OO
constructs directly in assembler. YMMV.

Hope this doesn't (further) muddy the water.
George

hopcode

unread,

May 31, 2012, 9:27:01 PM5/31/12

to

Il 31.05.2012 01:45, daniel...@nospicedham.gmail.com ha scritto:
> ...trying to find how does high level compilers implement classes and objects in x86 asm.

essential

http://www.openrce.org/articles/full_view/23

Cheers,

--
.:mrk[hopcode]
.:x64lab:.
group http://groups.google.com/group/x64lab
site http://sites.google.com/site/x64lab

io_x

unread,

Jun 1, 2012, 4:04:37 AM6/1/12

to

"George Neuner" <gneu...@nospicedham.comcast.net> ha scritto nel messaggio
news:ni5fs7h7dep3454if...@4ax.com...

> On Wed, 30 May 2012 16:45:31 -0700 (PDT),
> daniel...@nospicedham.gmail.com wrote:
>
>>Hi, I'm learning x86 assembly on my free time and I'm trying to find how
>>does high level compilers implement classes and objects in x86 asm. So
>>far I saw some toy compilers that use structs, but the way I understand
>>it, structs are abstraction feature provided by assemblers like MASM e
>>FASM. How would they implement classes and objects without structs?
>>Arrays and offsets?
>
> A struct[ure] (also called a record) is simply a group of data offsets
> related by a common base pointer. The code generator (or programmer)
> obviously needs to know the layout, types and sizes of the fields, but
> references to the fields then are simply pointer+offset.
>
> The trick with objects is how to implement dynamic dispatch to
> functions based on the object's type. There are 2 general ways to do
> this: generic functions with parametric dispatch or member functions.
> Imperitive languages tend to use member functions. FP languages tend
> to prefer generic functions, although some use a blend of generic and
> member functions.
>
>
> Member Functions:
>
> The very simplest way to implement member functions is to add to the
> data struct a pointer field for each function. However, this isn't
> very satisfactory because every object of the same "class" (i.e. the
> same type) will need to have pointers to all the same functions.
> Redundantly putting pointers into each object needlessly wastes space.

i prefer that space... i don't know vtables etc, i'm not too much smart...

the problem is always the same:
how one can goes down from OO [from C or C++] until the last
statement "mov [eax], ecx" easily...
or perhaps the easy operation *a=c;

problem is how to scale well
problem is the asm or the low level language result
is easy to read

if you don't scale language well
or the asm result is too much complex
done from machine to machines
for correct one error you can miss some weeks
instead of few minutes...

Buon Giorno

Daniel Zazula

unread,

Jun 1, 2012, 5:56:29 PM6/1/12

to

Hi, it's me again (the original poster), my name is Daniel and I'm using (new version of) google groups to post to this group, but, for some reason, my name don't appear in the original post.
Anyways, thank you all for the answers. I'm going to digest all that information for now.

George Neuner

unread,

Jun 2, 2012, 4:28:41 PM6/2/12

to

On Fri, 1 Jun 2012 10:04:37 +0200, "io_x" <a...@b.c.invalid> wrote:

>the problem is always the same:
>how one can goes down from OO [from C or C++] until the last
>statement "mov [eax], ecx" easily...
>or perhaps the easy operation *a=c;

Objects *are* structures, so data field access is just pointer
displacement addressing. The exciting part is implementing function
dispatch which depends on the runtime type of the object.

>problem is how to scale well

Don't follow.

>problem is the asm or the low level language result
>is easy to read

Assembler only very rarely is "easy to read" ... by any definition.

Regardless of code comments and/or documentation, IME (going on 22
years with 10 spent in high performance and hard real time systems)
any non-trivial algorithm expressed in assembler is a mess that no one
but its developer can easily understand. And after a few months away
often even the original developer will have trouble following it.

>if you don't scale language well
>or the asm result is too much complex
>done from machine to machines
>for correct one error you can miss some weeks
>instead of few minutes...

That's why OO code is compiler generated rather than implemented by
hand.

George

io_x

unread,

Jun 3, 2012, 5:30:19 AM6/3/12

to

"George Neuner" <gneu...@nospicedham.comcast.net> ha scritto nel messaggio

news:idrks7lg0njr6cpg1...@4ax.com...

> On Fri, 1 Jun 2012 10:04:37 +0200, "io_x" <a...@b.c.invalid> wrote:
>
>>the problem is always the same:
>>how one can goes down from OO [from C or C++] until the last
>>statement "mov [eax], ecx" easily...
>>or perhaps the easy operation *a=c;
>
> Objects *are* structures, so data field access is just pointer
> displacement addressing. The exciting part is implementing function
> dispatch which depends on the runtime type of the object.
>
>>problem is how to scale well
>
> Don't follow.
>
>>problem is the asm or the low level language result
>>is easy to read
>
> Assembler only very rarely is "easy to read" ... by any definition.

if you can not know how to traslate-recognize the complex operation in
more easy operation at kind of "*a=c" "*a|=c" "b&=r" etc
so in pratice assembly operations, you miss the way to check
if something is right or wrong...
you don't know what you are doing...
you can not explain one complex operation in term of words
there is need math-variables and easy operation on them

> Regardless of code comments and/or documentation, IME (going on 22
> years with 10 spent in high performance and hard real time systems)
> any non-trivial algorithm expressed in assembler is a mess that no one
> but its developer can easily understand. And after a few months away
> often even the original developer will have trouble following it.

i'm not agree, i find easier to find bug or understand in the macro
asm language i use better than C or C++
because the language is easier and has no UB...
because in asm i check for error more,
sometime even what is impossible to happen...

yes there is code a little more complex but if write it down
in some other language
it will be more complex because the algo is complex...

>>if you don't scale language well
>>or the asm result is too much complex
>>done from machine to machines
>>for correct one error you can miss some weeks
>>instead of few minutes...
>
> That's why OO code is compiler generated rather than implemented by
> hand.

it is not easy...
but i'm only a begginer, i seen only few...so i could make wrong

> George

George Neuner

unread,

Jun 4, 2012, 7:22:42 PM6/4/12

to

On Sun, 3 Jun 2012 11:30:19 +0200, "io_x" <a...@b.c.invalid> wrote:

>
>"George Neuner" <gneu...@nospicedham.comcast.net> ha scritto nel messaggio
>news:idrks7lg0njr6cpg1...@4ax.com...

>> Assembler only very rarely is "easy to read" ... by any definition.
>
>if you can not know how to traslate-recognize the complex operation in
>more easy operation at kind of "*a=c" "*a|=c" "b&=r" etc
>so in pratice assembly operations, you miss the way to check
>if something is right or wrong...
>you don't know what you are doing...
>you can not explain one complex operation in term of words
>there is need math-variables and easy operation on them

Recognizing a particular high level construct expressed in assembler
may be relatively easy (depends on the ISA). However, recognizing an
*algorithm* which is composed of many such constructs is extremely
difficult (in any ISA).

Looking at a compiler's output to verify correct operation is not
representative of the problem. You need to take a non-trivial piece
of code written by someone else and try to understand it.

Better yet, try to do it without comments. Anti-virus research is a
good example ... it can takes weeks (or even months) to figure out the
operation of a new specimen of malware.

Consider an array assignment in a high level language: A[i][j] = b

Even with simple scalar values and without array bounds checking, the
assignment expression above may require a dozen or more lines of
assembler depending on the ISA and the semantics of the high level
language. With bounds checking added, the assignment above may
require 20 or more lines of assembler. Moreover, the code for the
assignment may be discontiguous due to instruction or register
scheduling, and due to common expression hoisting by the compiler.

>> Regardless of code comments and/or documentation, IME (going on 22
>> years with 10 spent in high performance and hard real time systems)
>> any non-trivial algorithm expressed in assembler is a mess that no one
>> but its developer can easily understand. And after a few months away
>> often even the original developer will have trouble following it.
>
>i'm not agree,

That's your privilege. 8-)

> i find easier to find bug or understand in the macro
>asm language i use better than C or C++
>because the language is easier and has no UB...

What makes you think assembler has no undefined behavior?

>because in asm i check for error more,
>sometime even what is impossible to happen...

You don't check for these same errors in high level code?

>yes there is code a little more complex but if write it down
>in some other language
>it will be more complex because the algo is complex...

No. High level languages exist precisely because they can express
certain ideas more easily than can lower level languages. A single
page of Lisp code may be equivalent to 50 (or even more) pages of
assembler ... which do you honestly think will be easier to read and
understand?

Language expressive power has no relationship to language equivalence
power. An algorithm has the same order of complexity in any language
capable of expressing it. However, expression of the same algorithm
in different languages can appear to be more or less complicated.

Lisp, C and x86 assembler all are Turing complete and so anything that
can be expressed in one of them can be expressed in any of them. But
the number of lines of code required to write equivalent programs in
each of these languages differs by orders of magnitude.

George

io_x

unread,

Jun 5, 2012, 5:12:05 AM6/5/12

to

"George Neuner" <gneu...@nospicedham.comcast.net> ha scritto nel messaggio

news:ob7qs7l2m7ohebnfu...@4ax.com...

> On Sun, 3 Jun 2012 11:30:19 +0200, "io_x" <a...@b.c.invalid> wrote:
>
>>
>>"George Neuner" <gneu...@nospicedham.comcast.net> ha scritto nel messaggio
>>news:idrks7lg0njr6cpg1...@4ax.com...
>
>
>>> Assembler only very rarely is "easy to read" ... by any definition.
>>
>>if you can not know how to traslate-recognize the complex operation in
>>more easy operation at kind of "*a=c" "*a|=c" "b&=r" etc
>>so in pratice assembly operations, you miss the way to check
>>if something is right or wrong...
>>you don't know what you are doing...
>>you can not explain one complex operation in term of words
>>there is need math-variables and easy operation on them
>
> Recognizing a particular high level construct expressed in assembler
> may be relatively easy (depends on the ISA). However, recognizing an
> *algorithm* which is composed of many such constructs is extremely
> difficult (in any ISA).

an hll can be just a set of assembly function that do complex operations;
so if the name of these hll function is good chose,
if someone know the comment of these asm functions, what they do,
if these functions scale well the problem: all it is clear
from functions() to the last *a=c inside these functions

> Looking at a compiler's output to verify correct operation is not
> representative of the problem. You need to take a non-trivial piece
> of code written by someone else and try to understand it.
>
> Better yet, try to do it without comments. Anti-virus research is a
> good example ... it can takes weeks (or even months) to figure out the
> operation of a new specimen of malware.

in that, they have not the name of the functions, not their comment
that say what these functions do

> Consider an array assignment in a high level language: A[i][j] = b
>
> Even with simple scalar values and without array bounds checking, the
> assignment expression above may require a dozen or more lines of
> assembler depending on the ISA and the semantics of the high level
> language. With bounds checking added, the assignment above may
> require 20 or more lines of assembler. Moreover, the code for the
> assignment may be discontiguous due to instruction or register
> scheduling, and due to common expression hoisting by the compiler.
>
>>> Regardless of code comments and/or documentation, IME (going on 22
>>> years with 10 spent in high performance and hard real time systems)
>>> any non-trivial algorithm expressed in assembler is a mess that no one
>>> but its developer can easily understand. And after a few months away
>>> often even the original developer will have trouble following it.
>>
>>i'm not agree,
>
> That's your privilege. 8-)
>
>> i find easier to find bug or understand in the macro
>>asm language i use better than C or C++
>>because the language is easier and has no UB...
>
> What makes you think assembler has no undefined behavior?

there is UB, in instructions i use, if for example "mov [eax], ebx"
or "mov ebx, [eax]" and eax is not a valid address
it would be seg-falut and this is ok too
this is the only one type of UB i can find...

>>because in asm i check for error more,
>>sometime even what is impossible to happen...
>
> You don't check for these same errors in high level code?

no, overflow indices, overflow pointers, overflow unsigned
see if carry flag is set
afther some function call and jump on error section
it is difficult on hlls
it require more txt space than what i use in macro-asm
and more txt space there is, more difficult is to follow
the path of the program

>>yes there is code a little more complex but if write it down
>>in some other language
>>it will be more complex because the algo is complex...
>
> No. High level languages exist precisely because they can express
> certain ideas more easily than can lower level languages. A single
> page of Lisp code may be equivalent to 50 (or even more) pages of
> assembler ... which do you honestly think will be easier to read and
> understand?
>
> Language expressive power has no relationship to language equivalence
> power. An algorithm has the same order of complexity in any language
> capable of expressing it. However, expression of the same algorithm
> in different languages can appear to be more or less complicated.
>
> Lisp, C and x86 assembler all are Turing complete and so anything that
> can be expressed in one of them can be expressed in any of them. But
> the number of lines of code required to write equivalent programs in
> each of these languages differs by orders of magnitude.

i not agree, my macro-asm language is better of C or C++ for
make operation and algo on arrays; all the remain hll functions
and C++ operator too
can be done using functions implement algos on arrays...

but i possibly see wrong on all; do you have to consider i not
wrote any big program nor one significative algo ...

> George