Summary: translation units, preprocessing, compiling and linking?

Steven T. Hatton

unread,

Jun 2, 2004, 3:58:30 PM6/2/04

to

Is there anything that gives a good description of how source code is
converted into a translation unit, then object code, and then linked. I'm
particularly interested in understanding why putting normal functions in
header files results in multiple definition errors even when include guards
are used.
--
STH
Hatton's Law: "There is only One inviolable Law"
KDevelop: http://www.kdevelop.org SuSE: http://www.suse.com
Mozilla: http://www.mozilla.org

Victor Bazarov

unread,

Jun 2, 2004, 4:26:57 PM6/2/04

to

Steven T. Hatton wrote:
> Is there anything that gives a good description of how source code is
> converted into a translation unit, then object code, and then linked.

If that's a question, the answer is "probably, I don't know of any,
though".

> I'm
> particularly interested in understanding why putting normal functions in
> header files results in multiple definition errors even when include guards
> are used.

Each translation unit is converted into one object file (usually). If
the header that is included in more than one translation unit contains
a definition of a function, the body of the function, the machine code
for it, its definition, is placed by the compiler in each translation
unit that includes that header. Include guards only help against multiple
inclusion of the same header while compiling _the_same_ translation unit.

You need to simply understand that inclusion is like merging texts. Put
the text of the header in question inside each module where you say
"#include <thatheader>" and try compiling it in your mind.

--------------------------- a.h
#ifndef A_DOT_H
#define A_DOT_H
int foo() { return 42; } // definition
#endif
--------------------------- b.h
#ifndef B_DOT_H
#define B_DOT_H
#include "a.h"
extern "C" char blahblah(); // declaration
#endif
--------------------------- ab.cc
#include "a.h"
#include "b.h"
int main() {
return foo();
}
--------------------------- ba.cc
#include "b.h"
#include "a.h"

extern "C" char blahblah() { return 'Z'; } // definition
------------------------------------------------------------

When the compiler gets its hands on 'ab.cc' what does it see? It
sees the result of preprocessing the file. Here it is before it
was tokenized:
------------------------------ ab.cc.preprocessed
#ifndef A_DOT_H
#define A_DOT_H . here A_DOT_H is defined
int foo() { return 42; }
#endif
#ifndef B_DOT_H
#define B_DOT_H
#ifndef A_DOT_H . This part is ignored
#define A_DOT_H | because 'A_DOT_H' has been
int foo() { return 42; } | already defined above -- the
#endif ` role of include guards
extern "C" char blahblah();
#endif
int main() {
return foo();
}
-----------------------------------------------------
As you can see, 'ab.cc' when compiled will have one definition of 'foo',
one definition of 'main' (a declaration of 'blahblah' does not make it
to the object file).

Now, let's "preprocess" 'ba.cc'.
---------------------------------- ba.cc.preprocessed
#ifndef B_DOT_H
#define B_DOT_H
#ifndef A_DOT_H
#define A_DOT_H . here A_DOT_H is defined
int foo() { return 42; } // definition
#endif
extern "C" char blahblah(); // declaration
#endif
#ifndef A_DOT_H . This part is ignored
#define A_DOT_H | because A_DOT_H has been
int foo() { return 42; } // definition | defined just before -- again
#endif ` include guards at work
extern "C" char blahblah() { return 'Z'; } // definition
-----------------------------------------------------------------
As you can see, 'ba.cc' when compiled will have one definition of 'foo'
and one definition of 'blahblah'.

Now, when you link those two files together, how many definitions of
'foo' will there be? Which one should be used when called from 'main'?

For more information on how object files are created, try a book about
your compiler, it may have more details, besides, creation of object code
and linking are truly compiler-specific issues, not governed by the
language specification.

Victor

Mark A. Gibbs

unread,

Jun 2, 2004, 4:32:13 PM6/2/04

to

Steven T. Hatton wrote:

> Is there anything that gives a good description of how source code is
> converted into a translation unit, then object code, and then linked. I'm
> particularly interested in understanding why putting normal functions in
> header files results in multiple definition errors even when include guards
> are used.

a translation unit is a single source file, with all of it's includes.
given a project consisting of:

a.hpp
b.hpp
c.hpp // includes a.hpp
d.hpp // includes c.hpp and b.hpp
w.cpp
x.cpp // includes a.hpp
y.cpp // includes a.hpp and d.hpp
z.cpp // includes a.hpp, b.hpp, c.hpp and d.hpp

the files actually passed to the compiler are w, x, y and z. when the
compiler inspects each file:

translation unit w (becomes w.o or w.obj):
consists only of w.cpp

translation unit x (becomes x.o or x.obj):
consists of x.cpp and a.hpp

translation unit y (becomes y.o or y.obj):
consists of y.cpp and a.hpp and d.hpp
d.hpp consists of c.hpp and b.hpp
c.hpp consists of a.hpp(2)
therefore: consists of y.cpp, a.hpp, d.hpp, c.hpp, b.hpp, a.hpp(2)
the redundant copy of a.hpp is handled by include guards
therefore: consists of y.cpp, a.hpp, d.hpp, c.hpp, b.hpp

translation unit z (becomes z.o or z.obj):
consists of z.cpp, a.hpp, b.hpp, c.hpp and d.hpp
c.hpp consists of a.hpp(2)
d.hpp consists of c.hpp(2) and b.hpp(2)
c.hpp(2) consists of a.hpp(3)
therefore: consists of z.cpp, a.hpp, b.hpp, c.hpp, d.hpp, a.hpp(2),
c.hpp(2), b.hpp(2) and a.hpp(3)
the redundant copies are handled by include guards
therefore: consists of z.cpp, a.hpp, d.hpp, c.hpp, b.hpp

so far so good, your include guards have filtered out multiple
definitions. now comes link time. when w, x, y and z are linked (the .o
or .obj files that is), they have a total of 3 copies of any code
defined in a.hpp, and 2 copies of code in b.hpp, c.hpp and d.hpp.
therefore, multiple definition errors.

include guards are part of the preprocessor stage. they are used before
compiling, so multiple defintions are filtered out of your translation
units. once compiled into object code, however, your include guards are
gone. anything that got past the include guards to get compiled is in
the object code. so if multiple translation units have definitions of
the same source code object (function, variable, etc.), the code will be
in each translation unit's object code, and you'll get multiple
definition errors at compile time.

the moral of the story is do not put any definitions in headers, because
although you can assure that a single translation unit only has a single
definition, you cannot protect against multiple translation units
including (and each having one definition) ((well you can, but that's
nasty stuff)).

the language provides you with protection in the special case of
template code.

mark

Joe Laughlin

unread,

Jun 2, 2004, 4:46:48 PM6/2/04

to

Steven T. Hatton wrote:
> Is there anything that gives a good description of how
> source code is converted into a translation unit, then
> object code, and then linked. I'm particularly
> interested in understanding why putting normal functions
> in header files results in multiple definition errors
> even when include guards are used.

Perhaps
http://www.cs.washington.edu/orgs/acm/tutorials/dev-in-unix/compiler.html is
what you are looking for... ?

Steven T. Hatton

unread,

Jun 2, 2004, 6:32:06 PM6/2/04

to

Victor Bazarov wrote:

> Steven T. Hatton wrote:

>
> > I'm
>> particularly interested in understanding why putting normal functions in
>> header files results in multiple definition errors even when include
>> guards are used.

>

> Now, when you link those two files together, how many definitions of
> 'foo' will there be? Which one should be used when called from 'main'?

This will explain my current understanding:
http://baldur.globalsymmetry.com/gs-home/images/random-noise/solution.html

I guess what you explained is more or less what I thought, but then I was
told the #pragma interface and #pragma implementation used by GCC prevented
the creation of an object for each translation unit. I can only guess that
what is happening is that the 'global' object is treated as if it appeard
in the object file of each translation unit. That would explain why I
still get the linker errors using the pragmas. I'll see what the GCC folks
say about this.

Victor Bazarov

unread,

Jun 2, 2004, 6:38:57 PM6/2/04

to

Steven T. Hatton wrote:
> Victor Bazarov wrote:
>
>
>>Steven T. Hatton wrote:
>
>
>> > I'm
>>
>>>particularly interested in understanding why putting normal functions in
>>>header files results in multiple definition errors even when include
>>>guards are used.
>
>
>>Now, when you link those two files together, how many definitions of
>>'foo' will there be? Which one should be used when called from 'main'?
>
> This will explain my current understanding:
> http://baldur.globalsymmetry.com/gs-home/images/random-noise/solution.html

:-)

> I guess what you explained is more or less what I thought, but then I was
> told the #pragma interface and #pragma implementation used by GCC prevented
> the creation of an object for each translation unit. I can only guess that
> what is happening is that the 'global' object is treated as if it appeard
> in the object file of each translation unit. That would explain why I
> still get the linker errors using the pragmas. I'll see what the GCC folks
> say about this.

That's a wise route to take. #pragmas and linking is compiler-specific
and can be (and usually is) different with every compiler. Some compilers
do make an effort to mimic another compiler's behavior, but that's not the
usual situation.

When it's up to me, I just never put an implementation in a header without
declaring it "inline". Whether it's inlined or not, I don't care, and let
the compiler figure it out, but I never get multiple definition that way.

Good luck figuring it all out!

V

E. Robert Tisdale

unread,

Jun 2, 2004, 7:05:33 PM6/2/04

to

Steven T. Hatton wrote:

> Is there anything that gives a good description of how source code is
> converted into a translation unit, then object code, and then linked.
> I'm particularly interested in understanding why
> putting normal functions in header files
> results in multiple definition errors even when include guards are used.

When you invoke your C++ compiler,
the C preprocessor accepts a source file and emits a translation unit,
the C++ compiler proper accepts the translation unit
and emits assembler code,
the assembler accepts the assembler code and emits machine code and
the link editor accepts the machine code,
loads it into an executable program file
along with the required objects from library archives
and resolves all of the links.

When the C preprocessor reads your source file,
it includes the header files in the translation unit,
reads and processes all of the macros
then discards all of the macros when it has finished.
It does *not* remember macro definitions
when it processes the next source files
so, if the next source file includes the same header file,
the header file will be read again
and any external function definition in that header
will be included in the next translation unit as well.
The link editor will discover multiple function definitions
if it trys to link the resulting machine code files together.

If, instead, you qualify the function definition as inline or static,
the compiler will label them as "local" links
so the link editor will not complain.

JKop

unread,

Jun 3, 2004, 7:47:34 AM6/3/04

to

Steven T. Hatton posted:

> Is there anything that gives a good description of how source code is
> converted into a translation unit, then object code, and then linked.
> I'm particularly interested in understanding why putting normal
> functions in header files results in multiple definition errors even
> when include guards are used.

I highly suggest that this be added to the FAQ.

You have Source Code files:

a.cpp
b.cpp
s.cpp

You have 1 Header file:

b.hpp

Both a.cpp and s.cpp include b.hpp.

The three source code files get compiled into object files:

a.obj b.obj s.obj

And they're passed on to the linker.

The linker sees a function, Monkey, in a.obj AND in s.obj, hence a multiple
definition.

So how do you get away with putting inline functions into a header file?
They have internal linkage, ie. these functions aren't presented to the
linker. The "static" is implied in inline functions.

-JKop

Karl Heinz Buchegger

unread,

Jun 3, 2004, 8:25:09 AM6/3/04

to

"Steven T. Hatton" wrote:
>
> Is there anything that gives a good description of how source code is
> converted into a translation unit, then object code, and then linked. I'm
> particularly interested in understanding why putting normal functions in
> header files results in multiple definition errors even when include guards
> are used.

This is what I posted some time ago. It may be of some help
to you.

--------

First of all let me introduce a few terms and clearify
their meaning:

source code file The files which contains C or C++
code in the form of functions and/or
class definitions

header file Another form of source file. Header files
usually are used to seperate the 'interface'
description from the actual implementation
which resides in the source code files.

object code file The result of feeding a source code file through
the compiler. Object code files already contain
machine code, the one and only language your computer
understands. Nevertheless object code at this stage
is not executable. One object code file is the direct
translation of one source code file und thus usually
lacks external references, eg. the actual implementation
of functions which are defined in other source code files.

library file a collection of object code files. It happens frequently that
a set of object code files is always used together. Instead
of always listing all those object code files during the
link process it is often possible to build a library from
them and use the library instead. But there is no magic
with a library. A library can be seen as some repository
where one can deposit object code files such that the library
forms a collection of them.

compiling the process of transforming the source code files into
object code file. C and C++ define the concept of 'translation
unit'. Each translation unit (normally: one single source code
file) is translated independently of all other translation units.

linking the process of combining multiple object code files and libraries
into an executable. During the linking process all external references
of one object code file are examined and the linker tries to find
modules which satisfy those external references.

In practice the whole process works as follows:
Say you have 2 source files (with errors, we will return to them later)

main.c
******

int main()
{
foo();
}

test.c
******

void foo()
{
printf( "test\n" );
}

and you want to create an executable. The steps are
as in the graphics:

main.c test.c
+----------------+ +-----------------------+
| | | |
| int main() | | void foo() |
| { | | { |
| foo(); | | printf( "test\n" ); |
| } | | } |
+----------------+ +-----------------------+
| |
| |
v v
********** **********
* Compiler * * Compiler *
********** **********
| |
| |
| |
main.obj v test.obj v
+--------------+ +--------------+
| machine code | | machine code |
+--------------+ +--------------+
| |
| |
+------------------+ +--------------------+
| |
v v
************* Standard Library
* Linker *<----------+--------------------+
************* | eg. implementation |
| | of printf or the |
| | math functions |
| | |
| +--------------------+
main.exe v
+-------------------------+
| Executable which can |
| be run on a particluar |
| operating system |
+-------------------------+

So the steps are: compile each translation unit (each source file) independently
and then link the resulting object code files to form the executable. To do that
misssing functions (like printf or sqrt) are added by linking in a prebuilt library
which contains the object modules for them.

The important part is:
Each translation unit is compiled independently! So when the compiler compiles
test.c it has no knowledge about what happend in main.c and vice versa. When the
compiler tries to compile main.c it eventually reaches the line
foo();
where main.c tries to call function foo(). But the compiler has never heared about
a function foo! Even if you have compiled test.c prior to it, when main.c is
compiled this knowledge is already lost. Thus you have to inform the compiler
thar foo() is not a typing error and that there indeed is somewhere a function
called foo. You do this with an function prototype:

main.c
+----------------+
| void foo(); |
| |
| int main() |
| { |
| foo(); |
| } |
+----------------+
|
|
v
**********
* Compiler *
**********
|

Now the compiler knows about this function and can do its job. In very much the same way
the compiler has never heared about a function called printf(). printf is not part of
the 'core' language. In a conforming C implementation it has to exist somewhere, but
printf() is not on the same level as 'int' is. The compiler knows about 'int' and
what it means, but printf is just a function call and the compiler has to know its
parameters and return type in order to compile a call to it. Thus you have to inform
the compiler of its existence. You could do this in very much the same way as you
did it in main.c, by writing a prototype. But since this is needed so often and
there are so many other functions available, this very fast gets boring and error prone.
Thus somebody else has already provided all those protoypes in a seperate file, called
a header file, and instead of writing the protoypes by yourself, you simply 'pull in'
this header file and have them all available:

test.c
+-----------------------+
| #include <stdio.h> |<-+
| | |
| void foo() | |
| { | |
| printf( "test\n" ); | |
| } | |
+-----------------------+ |
| |
| |
v |
********** stdio.h v
* Compiler * +-------------------------------------+
********** | ... |
| | int printf( const char* fmt, ... ); |
| ... |
+-------------------------------------+

And now the compiler has everything it needs to know to compile test.c
Since main.c and test.c could have been compiled successfully they can be linked
to the final executable which can be run. During the process of linking the linked
figures out that there is a call to foo() in main.obj. Thus the linker tries to find
a function called foo. It finds this function by searching through the object
module test.obj. The linker thus inserts the correct memory address for foo
into main.obj and also includes foo from test.obj into the final executable. But
in doing so, the linker also figures out, that in function foo() there is a call
to printf. The linker thus searches for a function printf. It finds it in the
standard library, which is always searched when linking a C program. There the
linker finds a function printf and this function thus is included into the
final executable too. printf() by itself may use other functions to do its
work but the linker will find all of them in the standard library and include
them into the final executable.

There is one thing left to talk about. While main.c is correct from a technical
point of view it is still unsatisfying. Imagine that our functoni foo() has
a much more complicated argument list. Also imagine that your program does not
consist of just those 2 translation units but instead has 100-dreds of them and
that foo() needs to be called in 87 of them. Thus you would have write a prototype
in every single one of them. I think I don't have to tell you what that means: All those
prototypes need to be correct and just in case function foo() changes (things like
that happen), all those 87 prototypes need to be updated. So how can you do that?
You already know the solution, you have used it already. You do pretty much
the same as you did in the case of stdio.h. You write a header file and
include this instead of the prototype:

main.c
+-------------------+ test.h
| #include "test.h" |<---------+-------------+
| | | void foo(); |
| int main() | | |
| { | +-------------+
| foo(); |
| } |
+-------------------+
|
|
v
**********
* Compiler *
**********
|

Now you can include that header file in all the 87 translation units which
need to know about foo(). And if the prototype for foo() needs some update
you do it in one central place: by editing file test.h. All 87 translation
units will pull in this updated protype when they are recompiled.

HTH

--
Karl Heinz Buchegger
kbuc...@gascad.at

Steven T. Hatton

unread,

Jun 3, 2004, 9:07:35 AM6/3/04

to

E. Robert Tisdale wrote:

> Steven T. Hatton wrote:
>
>> Is there anything that gives a good description of how source code is
>> converted into a translation unit, then object code, and then linked.
>> I'm particularly interested in understanding why
>> putting normal functions in header files
>> results in multiple definition errors even when include guards are used.

> If, instead, you qualify the function definition as inline or static,

> the compiler will label them as "local" links
> so the link editor will not complain.

I think anonymous namespaces will act in a similar way. It seems there are
many ways to bang your thumb when working with C++ #includes. Oh, and then
there's the question of how the compiler knows something is a C++ source
file. If there is no distinction between source and header, they what
tells it that a particular file is a header? As it turns out:

"Compilation can involve up to four stages: preprocessing, compilation
proper, assembly and linking, always in that order. The first three stages
apply to an individual source file, and end by producing an object file;
linking combines all the object files (those newly compiled, and those
specified as input) into an executable file.

"For any given input file, the file name suffix determines what kind of
compilation is done:

file.c
C source code which must be preprocessed.

file.i
C source code which should not be preprocessed.

file.ii
C++ source code which should not be preprocessed.

file.m
Objective-C source code. Note that you must link with the
library libobjc.a to make an Objective-C program work.

file.mi
Objective-C source code which should not be preprocessed.

file.h
C header file (not to be compiled or linked).
file.cc
file.cp
file.cxx
file.cpp
file.c++
file.C
C++ source code which must be preprocessed. Note that in .cxx, the last two
letters must both be literally x. Likewise, .C refers to a literal capital
C."

So I type in 'gcc -ofoo main.cc' and get a bunch of errors, then type `g++
-ofoo main.cc' and the same code comepiles. Go figure!

Howard

unread,

Jun 3, 2004, 2:01:15 PM6/3/04

to

"Steven T. Hatton" <susu...@setidava.kushan.aa> wrote in message
news:8o-dnSG807b...@speakeasy.net...

This looks like you've quoted a particular implementation's documentation,
not the standard. If I recall correctly, there is nothing in the standard
that specifies that a file have a particular extension, or any extension at
all, for that matter. As far as I know, you can call your main source file
"Bob", and have it include a "header" file called "Carol". What makes
something a source file is if you instruct your compiler to compile it.
What makes it a header file is if you include it from one or more source
files, but don't directly compile it. (Actually, there is no such thing in
the standard as a "header file", if I recall. The term is used to specify
an include file that contains the declarations for the classes and/or
functions implemented in the source file.) Exactly how you instruct your
compiler to compile a file is up to the compiler vendor. Obviously, some
vendors chose to recognize specific file extensions as compileable (source)
files. (Possibly to make it easier to compile all source files in a given
directory?) Likewise, extensions like .a and .o are totally arbitrary,
although they tend to follow common practice.

-Howard

"All programmers write perfect code.
...All other programmers write crap."

"I'm never wrong.
I thought I was wrong once,
but I was mistaken."