Friday, June 5, 2015

Compiler

Compilation Steps
Very good tutorial
Very good Video tutorial

  • Pre-processing
    • Process the # Directives
  • Lexical Analysis
    • Strips out white spaces and comments
    • Divide the source code into lexemes, Outputs stream of tokens
    • Generates error on illegal lexemes: example:
      • 1_x = 1 + 2;    --> error: identifier the start with number
  • Syntax Analysis
    • Check the code against grammar, generate errors for wrong grammar.
    • Outputs a parse tree
    • Example of errors:
      • x = 1 + ;       --> can not generate parse tree
      • struct mystruct g,x;
        i = g * x;       --> i is not defined, g and x can not be multiplied, however no error is generated from syntax analyzer.
  • Semantic Analysis
    • Program symbol table is created, debug info is inserted
    • Checks the code for logical 
    • Error Examples:
      • struct mystruct g,x;
        i = g * x;    --> the multiplication operation is illegal for structs, i is not defined
      • warning: variable used before initialization
      • int i = "hi";  --> warning: implicit conversion from pointer to integer
  • Intermediate Code Generation
    • IR, Intermediate code Representation, is a machine code independent representation.
    • This helps to keep the parsing part of the compiler unchanged for different targets, while only the synthesis part is changed from target to target.
    • Output: AST (Abstract Syntax Tree) or Pseudo Code.
  • Machine Independent Code Optimization
    • Loop Unrolling
    • Expanding inline functions
    • Dead code removal
  • Code Generation
    • Conver IR code to machine opcodes
  • Machine Dependent Code Optimization
    • Register Allocation

More Details

  • Lexical Analysis (Tokenizing)
    • Input:
      • Is the c code
      • It consists of lexemes
    • Output
      • Tokens
        • a token is a <Token Class, "lexeme"> pair
    • Ex:
      • if (i == j)
            z = 0;
      • This is read by the parser as follows:
        • if (i == j)\n\tz = 0;
      • Step 1: Devide into lexemes:
        • |if| |(|i| |==| |j|)|\n\t|z| |=| |0|;|
      • Step 2: Identify the "Token Class" of each "lexeme"
        Hence, generate the tokens: <class,"lexeme"> pairs
        • <keyword,"if">
        • <whitespace," ">
        • <PAREN_OPEN,"(">
        • <identifier,"i">
        • <whitespace," ">
        • <operator,"==">
        • <whitespace," ">
        • <identifier,"j">
        • <PAREN_CLOSE,")">
        • <whitespace,"\n\t">
        • <identifier,"z">
        • <whitespace," ">
        • <EQ,"=">
        • <whitespace," ">
        • <integer,"0">
        • <SEMI_COLON,";">
      • Token Classes:
        • Whitespace
          • nonempty sequency of blanks, new lines or tabs
        • Integers
        • Keywords
        • Identifiers
        • Operators
        • Normally each "punctuation" lexemes, has its own class:
          • PAREN_O:           (
          • PAREN_C:           )
          • SEMI_COLON:    ;
          • EQ:                       =
  • Parsing (Syntax Analysis)
    • Input:
      • Sequence of tokens from the lexer
    • Output:
      • Parse Tree
    • Parser does the following:
      • Ensure that the the tokens follow C rules, to avoid syntax errors
      • Generates Parse tree:
      • Intermediate representation:
        • AST: Abstract Syntax Tree 
          • Also a pseudo code can replace AST
          • If the compiler supports different languages on different targets, then this AST or Pseudo code is machine independent
          • AST is like the Parse tree, but removing some unneeded info
    • Example:
      Parse Tree:

      The pic is from Online Courses Compiler Course

      Parse Tree Has some unneeded info:

      The pic is from Online Courses Compiler Course
  • AST (Abstract Syntax Tree):
    Generated as a sort of Intermediate Representation

    The pic is from Online Courses Compiler Course

  • Debug info: (mapping source code to machine code) maps functions, instructions in the mapped binary program, to the source code

        binary instruction => Item name, Item type, file name, line number, ...etc)
  • Symbol table: All identifiers in the source code is related to their memory segments and addresses

No comments:

Post a Comment