Ic Compiler Tutorial

0 views

Skip to first unread message

Adriana Gowen

unread,

Aug 4, 2024, 10:12:47 PM8/4/24

to anpytipan

Acompiler translates the code written in one language to some other language without changing the meaning of the program. It is also expected that a compiler should make the target code efficient and optimized in terms of time and space.

Compiler design principles provide an in-depth view of translation and optimization process. Compiler design covers basic translation mechanism and error detection & recovery. It includes lexical, syntax, and semantic analysis as front end, and code generation and optimization as back-end.

Computers are a balanced mix of software and hardware. Hardware is just a piece of mechanical device and its functions are being controlled by a compatible software. Hardware understands instructions in the form of electronic charge, which is the counterpart of binary language in software programming. Binary language has only two alphabets, 0 and 1. To instruct, the hardware codes must be written in binary format, which is simply a series of 1s and 0s. It would be a difficult and cumbersome task for computer programmers to write such codes, which is why we have compilers to write such codes.

We have learnt that any computer system is made of hardware and software. The hardware understands a language, which humans cannot understand. So we write programs in high-level language, which is easier for us to understand and remember. These programs are then fed into a series of tools and OS components to get the desired code that can be used by the machine. This is known as Language Processing System.

This tutorial is designed for students interested in learning the basic principles of compilers.Enthusiastic readers who would like to know more about compilers and those who wish to design a compiler themselves may start from here.

This tutorial requires no prior knowledge of compiler design but requires basic understanding of at least one programming language such as C, Java etc.It would be an additional advantage if you have had prior exposure to Assembly Programming.

Compiler Design is the process of creating software tools called compilers that translate human-readable code written in high-level programming languages, such as C++ or Java, into machine-readable code understood by computers, like assembly language or machine code. The goal of compiler design is to automate this translation process, making it more efficient and accurate. Compilers analyze the structure and syntax of the source code, perform various optimizations, and generate executable programs that can be run on computers.

We use compilers to convert human-readable code into machine-readable code so that computers can understand and execute it. Compilers streamline the translation process, making it faster and more efficient. They also enable programmers to write code in high-level languages, which are easier to understand and maintain. Additionally, compilers optimize the generated code for improved performance and portability across different computer systems.

The concept of a compiler was developed by Grace Hopper, an American computer scientist, in the 1950s. She created the first compiler, called the A-0 System, which translated mathematical notation into machine code. Hopper's invention revolutionized programming by allowing programmers to write code in human-readable languages rather than machine code, making software development faster and more accessible. Her pioneering work laid the foundation for modern compiler technology, which continues to be essential in computer programming today.

A compiler translates high-level programming code written by humans into machine-readable instructions that computers can understand and execute. It first analyzes the structure of the code and syntax to ensure correctness, then optimizes it for efficiency. Afterward, the compiler generates machine code, consisting of binary instructions modified to the computer's architecture. This process automates the translation of complex code, making programming more accessible and efficient for developers while enabling computers to execute tasks accurately.

Learning to write a basic compiler can vary in time depending on factors like prior programming experience and the complexity of the compiler. It might take several months to a year or more to understand the concepts and develop the skills needed to create a basic compiler. This process involves learning about lexical analysis, parsing, code generation, and optimization techniques, as well as gaining proficiency in a programming language and understanding computer architecture. Practice, experimentation, and learning from resources like books, tutorials, and online courses can help speed up the learning process.

A syntax tree in compiler design is a hierarchical representation of the structure of source code written in a programming language. It visually organizes the elements of code, such as expressions, statements, and declarations, into a tree-like structure based on the grammar rules of the language. Each node in the tree represents a specific syntactic construct, while the edges between nodes indicate relationships between them, such as parent-child or sibling relationships.

A compiler is considered system software. System software is a type of software that provides essential functions for a computer system to operate, manage resources, and support the execution of other software applications. A compiler falls into this category because it is responsible for translating high-level programming code into machine-readable instructions that computers can execute. Without a compiler, programmers would not be able to create software applications.

A token in compiler design is a basic building block of source code, representing the smallest unit of meaningful information. Think of tokens as the individual words or symbols in a sentence. In programming languages, tokens can include keywords (like "if" or "while"), identifiers (like variable names), operators (like "+" or "-"), literals (like numbers or strings), and punctuation (like semicolons or parentheses). During the lexical analysis phase of compilation, the compiler breaks down the source code into tokens, which are then used to understand the structure and meaning of the program.

Compiler architecture refers to the overall design and structure of a compiler. It consists of various components and stages involved in the compilation process, from analyzing the source code to generating machine-readable output. Compiler architecture generally includes modules for lexical analysis (breaking down code into tokens), syntax analysis (parsing the structure of the code), semantic analysis (checking for meaning and correctness), optimization (improving the efficiency of the code), and code generation (producing machine code).

It is a beautiful day outside, so let's make a compiler. You don't need any knowledge of how compilers work to follow along. We are going to use Python to implement our own programming language, Teeny Tiny, that will compile to C code. It will take about 500 lines of code and provide the initial infrastructure needed to customize and extend the compiler into your own billion dollar production-ready compiler.

This tutorial is a series of posts that go step by step in building a working compiler. All of the source code can be found in the GitHub repo. If you follow along with all the posts, I guesstimate that it will take you only a few hours.

The Teeny Tiny language we are going to implement is a dialect of BASIC. The syntax is clean and simple. If you prefer a C-like syntax then it will be trivial to modify the compiler at the end. Here is an example program in our Teeny Tiny language:

Although this is a standard subset of features, you may notice that there are no functions, no arrays, no way to read/write from a file, and not even an else statement. But with just this small set of constructs, you can actually do a lot. It will also setup the compiler in such a way that many other features will be straight forward to add later.

Our compiler will follow a three step process that is illustrated above. First, given the inputted source code, it will break the code up into tokens. These are like words and punctuation in English. Second, it will parse the tokens to make sure they are in an order that is allowed in our language. Just like English sentences follow specific structures of verbs and nouns. Third, it will emit the C code that our language will translate to.

We will use these three steps as the main organization for our code. The lexer, parser, and emitter will each have their own Python code file. This tutorial is broken up into 3 parts based on these steps as well. If you were to extend the compiler, there are some additional steps you would add, but we will hold off on discussing those.

The first module of our compiler is called the lexer. Given a string of Teeny Tiny code, it will iterate character by character to do two things: decide where each token starts/stops and what type of token it is. If the lexer is unable to do this, then it will report an error for an invalid token.

The figure demonstrates an example input and output of the lexer. Given the Teeny Tiny code, the lexer must determine where the tokens are along with the type (e.g., keyword). You can see that spaces aren't recognized as tokens, but the lexer will use them as one way to know when a token ends.

I like to sketch out all the functions that I think I will need, then go back and fill them in. The function getToken will be the meat of the lexer. It will be called each time the compiler is ready for the next token and it will do the work of classifying tokens. nextChar and peek are helper functions for looking at the next character. skipWhitespace consumes the spaces and tabs that we don't care about. abort is what we will use to report an invalid token.

The lexer needs the input code and we append a newline to it (this just simplifies some checks later on). curChar is what the lexer will constantly check to decide what kind of token it is. Why not just do source[curPos]? Because that would litter the code with bounds checking. Instead we do this in nextChar: