[gogc] 2 new revisions pushed by bytbox on 2010-08-18 01:07 GMT

1 view

Skip to first unread message

go...@googlecode.com

unread,

Aug 18, 2010, 12:16:12 AM8/18/10

to go...@googlegroups.com

2 new revisions:

Revision: 9784736374
Author: Michael Lippautz <michael....@gmail.com>
Date: Thu Jul 1 13:07:49 2010
Log: documentation: Add first parser paragraph
http://code.google.com/p/gogc/source/detail?r=9784736374

Revision: 381c907edd
Author: Andreas Unterweger <and...@gmx.at>
Date: Sat Jul 3 02:04:37 2010
Log: docs: Function calls
http://code.google.com/p/gogc/source/detail?r=381c907edd

==============================================================================
Revision: 9784736374
Author: Michael Lippautz <michael....@gmail.com>
Date: Thu Jul 1 13:07:49 2010
Log: documentation: Add first parser paragraph
http://code.google.com/p/gogc/source/detail?r=9784736374

Modified:
/docs/latex/gogo.pdf
/docs/latex/gogo.tex

=======================================
--- /docs/latex/gogo.pdf Thu Jul 1 12:39:03 2010
+++ /docs/latex/gogo.pdf Thu Jul 1 13:07:49 2010
Binary file, no diff available.
=======================================
--- /docs/latex/gogo.tex Thu Jul 1 12:39:03 2010
+++ /docs/latex/gogo.tex Thu Jul 1 13:07:49 2010
@@ -44,6 +44,7 @@
\end{enumerate}

\section{EBNF}
+ \label{sec:ebnf}
Lorem ipsum dolor sit amet...

\subsection*{Atoms}
@@ -158,9 +159,9 @@
way are called simple tokens, as they can be generated right away. For
instance, a sequence \textit{'A'} can be directly converted into a
token
representing a byte value. \\
- Before providing a token to the parser, the scanner may convert such a
- simple token one more time. These complex tokens are generated from
simple
- ones that represent identifiers. The identifiers are compared to a
+ Before providing a token to the parser, the scanner may convert such
+ simple tokens one more time. These complex tokens are generated from
simple
+ ones that represent \texttt{identifiers}. \texttt{Identifiers} are
compared to a
predefined list of keywords. If a keyword matches a token value, the
token
is converted to the one representing the keyword. Table
\ref{tbl:tokens}
lists some of these tokens.
@@ -169,13 +170,25 @@
\centering
\begin{tabular}{ll}
\toprule
- Simple tokens & \&\&, +, -, \{, \}, $($, $)$, \dots \\
- Complex tokens & for, if, else, func, type \\
+ Simple tokens & \texttt{\&\&}, \texttt{+}, \texttt{-},
\texttt{\{}, \texttt{\}}, \texttt{$($}, \texttt{$)$}, \dots \\
+ Complex tokens & \texttt{for}, \texttt{if}, \texttt{else},
\texttt{func}, \texttt{type} \\
\bottomrule
\end{tabular}
\caption{Token examples}
\label{tbl:tokens}
\end{table}
+
+ The scanner also implements a very simple escaping mechanism that
allows
+ sequences like \textit{'\textbackslash n'} to be used in strings. \\
+ Comments can be written as \textit{/* ... */} blocks or
+ \textit{\textbackslash \textbackslash} till line ending, like in C/C++.
+ \\ \\
+ The parser then takes these tokens and represents the language defined
by
+ the EBNF from section \ref{sec:ebnf}. The parser is basically
implemented
+ as LL1 like in \cite{wir96}, with one minor difference. In order to be
+ compatible with Go it was necesarry to include one namespace hierachy
that
+ is represented by packages. Since this namescope is prefixed
+ using \texttt{package.} the parser needs a lookup of three in this
case.

\chapter{Symbol table}
In order to be able to lookup local and global variable names as well
as function names, a symbol table is required. Based on \cite{wir96},
object and type descriptors were used, each containing the information
required for lookup and code generation. Object descriptors are used to
store information about variables and parameters whereas type descriptors
are used to store information about types and functions.\\
@@ -275,19 +288,27 @@
The addition, substraction and multiplication operations are also
used for offset calculations. Thus, an additional distinction per
\texttt{Item} is required in order to be able to distinguish between
addresses and values stored in registers. As arithmetical operations on
\texttt{byte} and \texttt{uint64} types always operate on a value, the
actual value to be calculated with has to be loaded prior calculation. As
offset calculations always require the address to be loaded into the
register instead of the value, it has to be made sure that the address is
loaded, not the value. In order to distinguish between addresses and values
in registers, the \texttt{A} field of the \texttt{Item} structure is used.
Additionally, the code generation routines for addition and subtraction
have an additional parameter specifying whether to calculate with addresses
or values, issuing the necessary dereferencing operations if required.

\section{The generation of assignments}
- As pointer types are supported, type checks in assignments as well
as the assignments themselves get harder to implement as additional cases
have to be dealt with. Additionally, the possible occurrence of the address
operator (\texttt{\&}) on the right hand side of an assignment doubles the
number of cases. The following table illustrates the distinctions made and
the code generated for some of the cases allowed by the EBNF (* denotes
pointer types, LHS and RHS are the \texttt{Items} on the left and right
hand side, respectively). For the sake of clearity, only the cases with a
non-pointer type variable \texttt{Item} on the left hand side and no
address operator on the right hand side using \texttt{uint64} types are
shown. The compiler is also able to assign \texttt{byte} values to one
another as well as to \texttt{uint64} types.\\ \\
- \begin{tabular}{|l|l|l|}
- \hline
- \textbf{LHS type} & \textbf{RHS type} & \textbf{Code/Error
generated}\\ \hline
- Variable & Constant & \texttt{MOVQ \$RHS.A, (LHS.A)}\\ \hline
- Variable & Constant* & Type error\\ \hline
- Variable & Variable & \texttt{MOVQ (RHS.A), R\textit{temp} | MOVQ
R\textit{temp}, (LHS.A)}\\ \hline
- Variable & Variable* & Type error\\ \hline
- Variable & Register with value & \texttt{MOVQ RHS.R, (LHS.A)}\\
\hline
- Variable & Register with value* & Type error\\ \hline
- Variable & Register with address & \texttt{MOVQ (RHS.R), RHS.R |
MOVQ RHS.R, (LHS.A)}\\ \hline
- Variable & Register with address* & Type error\\ \hline
- \end{tabular}\\ \\
+ As pointer types are supported, type checks in assignments as well
as the assignments themselves get harder to implement as additional cases
have to be dealt with. Additionally, the possible occurrence of the address
operator (\texttt{\&}) on the right hand side of an assignment doubles the
number of cases. The following table \ref{tbl:assigntypes} illustrates the
distinctions made and the code generated for some of the cases allowed by
the EBNF (* denotes pointer types, LHS and RHS are the \texttt{Items} on
the left and right hand side, respectively). For the sake of clearity, only
the cases with a non-pointer type variable \texttt{Item} on the left hand
side and no address operator on the right hand side using \texttt{uint64}
types are shown. The compiler is also able to assign \texttt{byte} values
to one another as well as to \texttt{uint64} types.
+
+ \begin{table}[htb]
+ \begin{tabular}{lll}
+ \toprule
+ \textbf{LHS type} & \textbf{RHS type} & \textbf{Code/Error
generated}\\
+ \midrule
+ Variable & Constant & \texttt{MOVQ \$RHS.A, (LHS.A)}\\
+ Variable & Constant* & Type error\\
+ Variable & Variable & \texttt{MOVQ (RHS.A), R\textit{temp} | MOVQ
R\textit{temp}, (LHS.A)}\\
+ Variable & Variable* & Type error\\
+ Variable & Register with value & \texttt{MOVQ RHS.R, (LHS.A)}\\
+ Variable & Register with value* & Type error\\
+ Variable & Register with address & \texttt{MOVQ (RHS.R), RHS.R |
MOVQ RHS.R, (LHS.A)}\\
+ Variable & Register with address* & Type error\\
+ \bottomrule
+ \end{tabular}
+ \caption{Assignment types}
+ \label{tbl:assigntypes}
+ \end{table}
+
String assignments are handled separately as they require 16 bytes
to be assigned. As one register can only hold 8 bytes, a second register
needs to be allocated in order to perform the assignment. Although this may
not be necessary in trivial cases where one string variable is assigned to
another, strings in structures which require offset calculations force the
use of a register when performing the offset calculation and therefore
require a second register to dereference the value of the other 8 bytes. In
order to be able to do this, the \texttt{C} field of the \texttt{Item}
structure is being used as it is not needed for other purposes outside
conditionals.

\section{The generation of conditional expressions}
@@ -312,17 +333,25 @@

\section{I/O syscalls}
Besides read and write operations to files (and the console),
exiting the program as well as opening and closing files requires the use
of syscalls. On Linux 64 bit operating systems with Intel architecture,
these syscalls can be invoked by the assembly mnemonic \texttt{SYSCALL}
where the register \texttt{RAX} contains the syscall number defining the
syscall, and the registers \texttt{RDI}, \texttt{RSI} and \texttt{RDX}
contain the first, second and third parameter respectively\cite{var08}.\\
- The following table lists the syscalls used by libgogo, together
with the value of \texttt{RAX} representing the syscall number. The latter
were derived from the constants defined in
\texttt{/usr/src/linux-headers-2.6.32-22/arch/x86/include/asm/unistd\_64.h}
of the current Linux kernel source\cite{var10}. The syscall function
prototypes (for semantics and formal parameters) were derived from the
corresponding Linux man pages (see \cite{var97} and others).\\ \\
- \begin{tabular}{|c|c|c|}
- \hline
- \textbf{Syscall number (\texttt{RAX})} & \textbf{Syscall function}
& \textbf{Purpose}\\ \hline
- 0 & \texttt{sys\_read} & Reads from a file\\ \hline
- 1 & \texttt{sys\_write} & Writes to a file\\ \hline
- 2 & \texttt{sys\_open} & Opens a file\\ \hline
- 3 & \texttt{sys\_close} & Closes a file\\ \hline
- 12 & \texttt{sys\_brk} & See next section\\ \hline
- 60 & \texttt{sys\_exit} & Exits the program\\ \hline
+ The following table \ref{tbl:syscalls} lists the syscalls used by
libgogo, together with the value of \texttt{RAX} representing the syscall
number. The latter were derived from the C constants defined in
\texttt{/usr/src/linux-headers-2.6.32-22/arch/x86/include/asm/unistd\_64.h}
of the current Linux kernel source\cite{var10}. The syscall function
prototypes (for semantics and formal parameters) were derived from the
corresponding Linux man pages (see \cite{var97} and others).
+
+ \begin{table}[htb]
+ \centering
+ \begin{tabular}{ccc}
+ \toprule
+ \textbf{Syscall number (\texttt{RAX})} & \textbf{Syscall function}
& \textbf{Purpose}\\
+ \midrule
+ 0 & \texttt{sys\_read} & Reads from a file\\
+ 1 & \texttt{sys\_write} & Writes to a file\\
+ 2 & \texttt{sys\_open} & Opens a file\\
+ 3 & \texttt{sys\_close} & Closes a file\\
+ 12 & \texttt{sys\_brk} & See next section\\
+ 60 & \texttt{sys\_exit} & Exits the program\\
+ \bottomrule
\end{tabular}
+ \caption{Syscalls}
+ \label{tbl:syscalls}
+ \end{table}

\section{The memory manager}
Libgogo provides a very simple memory manager using a bump pointer
which can allocate, but not free memory. By using the
\texttt{sys\_brk}\cite{var97} function, the memory manager expands the data
segment of the running program in steps of 10 KB if necessary in order to
deal with subsequent allocations.\\

==============================================================================
Revision: 381c907edd
Author: Andreas Unterweger <and...@gmx.at>
Date: Sat Jul 3 02:04:37 2010
Log: docs: Function calls
http://code.google.com/p/gogc/source/detail?r=381c907edd

Modified:
/docs/latex/gogo.pdf
/docs/latex/gogo.tex

=======================================
--- /docs/latex/gogo.pdf Thu Jul 1 13:07:49 2010
+++ /docs/latex/gogo.pdf Sat Jul 3 02:04:37 2010
Binary file, no diff available.
=======================================
--- /docs/latex/gogo.tex Thu Jul 1 13:07:49 2010
+++ /docs/latex/gogo.tex Sat Jul 3 02:04:37 2010
@@ -30,6 +30,7 @@
\tableofcontents

\chapter{Introduction}
+ GoGo is a self-compiling Go compiler written in Go which implements
scanning, parsing and code generation for a subset of the Go
language\cite{goo10}. It is mostly compatible with the Go compiler
available in \cite{goo10} and outputs valid Plan9 assembly code which can
be assembled by the latter as well as the Plan9 assembly tools \texttt{6a}
and \texttt{6l}\cite{pik00}.\\
Lorem ipsum dolor sit amet...

\chapter{Input Language}
@@ -317,8 +318,51 @@
\section{The generation of loops}
Lorem ipsum dolor sit amet...

- \section{The generation of functions} \label{The generation of
functions}
- Lorem ipsum dolor sit amet...
+ \section{The generation of functions calls} \label{The generation of
functions}
+ Consider the following example with the current function having 3
local variables -- 2 of type \texttt{uint64} and one of type
\texttt{string} -- calling the function \texttt{foo} which has the
following prototype:
+ \begin{lstlisting}
+func foo(param1 uint64, param2 uint64, param3 string) uint64;
+ \end{lstlisting}
+ In order to be compatible to the Go calling convention, the stack is
organized as follows before calling \texttt{foo}:
+ \begin{table}[h!]
+ \begin{tabular}{ll}
+ \toprule
+ \textbf{Address} & \textbf{Content}\\
+ \midrule
+ \texttt{SP-0} & Saved \texttt{IP}\\
+ \texttt{SP-8} & Local variable 1 (uint64)\\
+ \texttt{SP-16} & Local variable 2 (uint64)\\
+ \texttt{SP-24} & Local variable 3 (string), higher 8 bytes\\
+ \texttt{SP-32} & Local variable 3 (string), lower 8 bytes\\
+ \midrule
+ \texttt{SP-40} & Placeholder for return value of foo\\
+ \texttt{SP-48} & \texttt{param3} of \texttt {foo}, higher 8 bytes\\
+ \texttt{SP-56} & \texttt{param3} of \texttt {foo}, lower 8 bytes\\
+ \texttt{SP-64} & \texttt{param2} of \texttt {foo}\\
+ \texttt{SP-72} & \texttt{param1} of \texttt {foo}\\
+ \bottomrule
+ \end{tabular}
+ \end{table}\\ \\
+ As can be seen, the parameters are pushed in reverse order and share
the address calculation algorithm of local variables (considering the
offsets of the local variables), allowing to reuse the latter for this
purpose. In order to move the parameters to their corresponding positions
on the stack, the offsets are calculated and an assignment statement with
the stack offset on the LHS (as variable \texttt{Item}) is being performed
which allows to reuse the code for assignment generation, including all
type checks and conversions. Optionally, currently used registers are
pushed onto the stack between the local variables and the return value,
yielding an offset correction to be considered when pushing the
parameters.\\
+ When calling \texttt{foo} in the above example, the \texttt{SP} is
first decreased by 72, followed by another implicit decrease of 8 due to
the \texttt{CALL} instruction which pushes the current \texttt{IP} onto the
stack and decreases the \texttt{SP} automatically. The stack is then
prepared for \texttt{foo} to which it appears similar as to the callee,
with the old \texttt{IP} at offset \texttt{SP}, the parameters on positive
offsets relative to \texttt{SP} and the local variables (if any) with
negative offsets.
+ \begin{table}[h!]
+ \begin{tabular}{lll}
+ \toprule
+ \textbf{Address after call} & \textbf{Address before call} &
\textbf{Content}\\
+ \midrule
+ \texttt{SP+40} & \texttt{SP-40} & Placeholder for return value of
foo\\
+ \texttt{SP+32} & \texttt{SP-48} & \texttt{param3} of \texttt
{foo}, higher 8 bytes\\
+ \texttt{SP+24} & \texttt{SP-56} & \texttt{param3} of \texttt
{foo}, lower 8 bytes\\
+ \texttt{SP+16} & \texttt{SP-64} & \texttt{param2} of \texttt
{foo}\\
+ \texttt{SP+8} & \texttt{SP-72} & \texttt{param1} of \texttt {foo}\\
+ \texttt{SP} & -- & \texttt{IP} of callee\\
+ \texttt{SP-8} & -- & Possible local variable of \texttt{foo}\\
+ \bottomrule
+ \end{tabular}
+ \end{table}\\ \\
+ As can be seen, the offsets before and after the call differ by
\texttt{72+8=80} as explained above. GoGo does not use base or frame
pointers, but relies only on the stack pointer to perform all offset
calculations. The return value is always located "after" the other
parameters and can be used as the RHS of an assignment with its known
offset as explained with the parameters above. After the function call, 80
is added to the stack pointer in order to restore the previous stack "view"
for the callee. Optionally saved registered are restored considering their
additional offset in reverse order.\\
+ Internally, a function is represented by a \texttt{TypeDesc}
(compare table \ref{tbl:typedesc}) where the fields represent the functions
parmaeters, including an optional return value at their end. When a
function is being called without prior declaration, the parameter types
have to be derived from the types of the expressions encountered. As the
total number of parameters is unknown before the end function call in the
input file, a marker is inserted in the output code and comments to
indicate that the final offset has yet to be determined. As the offset due
to local variables is known, it is included. Due to the marker which
includes information about the name of the function called, the linker is
able to adapt the unfinished offsets in order to correct them accordingly.\\
+ When functions have been called, but not declared, and then called
again, the type check included in the code generation for assignments is
already effective and detects improper assignment types. However, it is
necessary to extend this check as the first call can p.e. be with a
parameter of type \texttt{byte} and the second call with a parameter of
type \texttt{uint64}. This is allowed, as a \texttt{byte} type represents a
subset of the \texttt{uint64} type, requiring a conversion for the former
in order to assure that the 7 additional bytes are zero. This can be done
by simply extending the type check of the assignment. This also applies to
the return value (as it is treated like a parameter), although it has to be
considered that calls of functions without return value assignments do not
necessarily mean that the function called has no return value. Therefore,
additional checks are necessary, including the check of the return value's
existence and type when the actual function declaration is being made.

\section{Global variable initialization}
Besides the compiled functions from the input file, the code
segment contains a function called main.init which performs the
initialization of global variables. In contrast to local variables which
can be initialized directly at point of their declaration, global variables
need to be initialized before any other methods are called, thus requiring
the main.init function.\\

Reply all

Reply to author

Forward

0 new messages