Compilation Steps
Very good tutorial
Very good Video tutorial
- Pre-processing
- Process the # Directives
- Lexical Analysis
- Strips out white spaces and comments
- Divide the source code into lexemes, Outputs stream of tokens
- Generates error on illegal lexemes: example:
- 1_x = 1 + 2; --> error: identifier the start with number
- Syntax Analysis
- Check the code against grammar, generate errors for wrong grammar.
- Outputs a parse tree
- Example of errors:
- x = 1 + ; --> can not generate parse tree
- struct mystruct g,x;
i = g * x; --> i is not defined, g and x can not be multiplied, however no error is generated from syntax analyzer. - Semantic Analysis
- Program symbol table is created, debug info is inserted
- Checks the code for logical
- Error Examples:
- struct mystruct g,x;
i = g * x; --> the multiplication operation is illegal for structs, i is not defined - warning: variable used before initialization
- int i = "hi"; --> warning: implicit conversion from pointer to integer
- Intermediate Code Generation
- IR, Intermediate code Representation, is a machine code independent representation.
- This helps to keep the parsing part of the compiler unchanged for different targets, while only the synthesis part is changed from target to target.
- Output: AST (Abstract Syntax Tree) or Pseudo Code.
- Machine Independent Code Optimization
- Loop Unrolling
- Expanding inline functions
- Dead code removal
- Code Generation
- Conver IR code to machine opcodes
- Machine Dependent Code Optimization
- Register Allocation
More Details
- Lexical Analysis (Tokenizing)
- Input:
- Is the c code
- It consists of lexemes
- Output
- Tokens
- a token is a <Token Class, "lexeme"> pair
- Ex:
- if (i == j)
z = 0; - This is read by the parser as follows:
- if (i == j)\n\tz = 0;
- Step 1: Devide into lexemes:
- |if| |(|i| |==| |j|)|\n\t|z| |=| |0|;|
- Step 2: Identify the "Token Class" of each "lexeme"
Hence, generate the tokens: <class,"lexeme"> pairs - <keyword,"if">
- <whitespace," ">
- <PAREN_OPEN,"(">
- <identifier,"i">
- <whitespace," ">
- <operator,"==">
- <whitespace," ">
- <identifier,"j">
- <PAREN_CLOSE,")">
- <whitespace,"\n\t">
- <identifier,"z">
- <whitespace," ">
- <EQ,"=">
- <whitespace," ">
- <integer,"0">
- <SEMI_COLON,";">
- Token Classes:
- Whitespace
- nonempty sequency of blanks, new lines or tabs
- Integers
- Keywords
- Identifiers
- Operators
- Normally each "punctuation" lexemes, has its own class:
- PAREN_O: (
- PAREN_C: )
- SEMI_COLON: ;
- EQ: =
- Parsing (Syntax Analysis)
- Input:
- Sequence of tokens from the lexer
- Output:
- Parse Tree
- Parser does the following:
- Ensure that the the tokens follow C rules, to avoid syntax errors
- Generates Parse tree:
- Intermediate representation:
- AST: Abstract Syntax Tree
- Also a pseudo code can replace AST
- If the compiler supports different languages on different targets, then this AST or Pseudo code is machine independent
- AST is like the Parse tree, but removing some unneeded info
- Example:
Parse Tree:
The pic is from Online Courses Compiler Course
Parse Tree Has some unneeded info:
The pic is from Online Courses Compiler Course
- AST (Abstract Syntax Tree):Generated as a sort of Intermediate Representation
The pic is from Online Courses Compiler Course
- Debug info: (mapping source code to machine code) maps functions, instructions in the mapped binary program, to the source code
binary instruction => Item name, Item type, file name, line number, ...etc) - Symbol table: All identifiers in the source code is related to their memory segments and addresses
No comments:
Post a Comment