Walls ...: Compiler

Friday, June 5, 2015

Compiler

Compilation Steps
Very good tutorial
Very good Video tutorial

Pre-processing

Process the # Directives

Lexical Analysis

Strips out white spaces and comments
Divide the source code into lexemes, Outputs stream of tokens
Generates error on illegal lexemes: example:

1_x = 1 + 2; --> error: identifier the start with number

Syntax Analysis

Check the code against grammar, generate errors for wrong grammar.
Outputs a parse tree
Example of errors:

x = 1 + ; --> can not generate parse tree
struct mystruct g,x;
i = g * x; --> i is not defined, g and x can not be multiplied, however no error is generated from syntax analyzer.

Semantic Analysis

Program symbol table is created, debug info is inserted
Checks the code for logical
Error Examples:

struct mystruct g,x;
i = g * x; --> the multiplication operation is illegal for structs, i is not defined
warning: variable used before initialization
int i = "hi"; --> warning: implicit conversion from pointer to integer

Intermediate Code Generation

IR, Intermediate code Representation, is a machine code independent representation.
This helps to keep the parsing part of the compiler unchanged for different targets, while only the synthesis part is changed from target to target.
Output: AST (Abstract Syntax Tree) or Pseudo Code.

Machine Independent Code Optimization

Loop Unrolling
Expanding inline functions
Dead code removal

Code Generation

Conver IR code to machine opcodes

Machine Dependent Code Optimization

Register Allocation

More Details

Lexical Analysis (Tokenizing)

Input:

Is the c code
It consists of lexemes

Output

Tokens

a token is a <Token Class, "lexeme"> pair

Ex:

if (i == j)
z = 0;
This is read by the parser as follows:

if (i == j)\n\tz = 0;

Step 1: Devide into lexemes:

|if| |(|i| |==| |j|)|\n\t|z| |=| |0|;|

Step 2: Identify the "Token Class" of each "lexeme"
Hence, generate the tokens: <class,"lexeme"> pairs

<keyword,"if">
<whitespace," ">
<PAREN_OPEN,"(">
<identifier,"i">
<whitespace," ">
<operator,"==">
<whitespace," ">
<identifier,"j">
<PAREN_CLOSE,")">
<whitespace,"\n\t">
<identifier,"z">
<whitespace," ">
<EQ,"=">
<whitespace," ">
<integer,"0">
<SEMI_COLON,";">

Token Classes:

Whitespace

nonempty sequency of blanks, new lines or tabs

Integers
Keywords
Identifiers
Operators
Normally each "punctuation" lexemes, has its own class:

PAREN_O: (
PAREN_C: )
SEMI_COLON: ;
EQ: =

Parsing (Syntax Analysis)

Input:

Sequence of tokens from the lexer

Output:

Parse Tree

Parser does the following:

Ensure that the the tokens follow C rules, to avoid syntax errors
Generates Parse tree:

Intermediate representation:

AST: Abstract Syntax Tree

Also a pseudo code can replace AST
If the compiler supports different languages on different targets, then this AST or Pseudo code is machine independent
AST is like the Parse tree, but removing some unneeded info

Example:
Parse Tree:

The pic is from Online Courses Compiler Course

Parse Tree Has some unneeded info:

The pic is from Online Courses Compiler Course

AST (Abstract Syntax Tree):
Generated as a sort of Intermediate Representation

The pic is from Online Courses Compiler Course
Debug info: (mapping source code to machine code) maps functions, instructions in the mapped binary program, to the source code

binary instruction => Item name, Item type, file name, line number, ...etc)
Symbol table: All identifiers in the source code is related to their memory segments and addresses

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)