This repository contains the implementation of assignments from the Compiler Sessional course, offered in the Department of Computer Science and Engineering at Bangladesh University of Engineering and Technology (BUET). The course focuses on developing a C compiler that performs lexical, syntax, and semantic analysis, and generates intermediate code.
- Introduction
- Getting Started
- Assignments
- Making it Easier: C Compiler
- Backpatching
- Disclaimer
- Contributing
- License
This repository contains the implementation of various assignments for the CSE-310 Compiler Sessional course. The primary objective is to develop a C compiler that performs comprehensive lexical, syntax, and semantic analysis and generates intermediate code for the 8086 assembly language.
To get started with this repository:
-
Clone the repository:
git clone https://github.com/Anonto050/CSE-310-Compiler.git
-
Navigate to the project directory:
cd CSE-310-Compiler
-
Build the project:
make
-
Run the compiler on a C file:
./compiler program_name.c
Symbol Table
Lexical Analysis
Syntax & Semantic Analysis
Intermediate Code Generation
This project involves the explanation of development of the C compiler that performs lexical, syntax, and semantic analysis, and generates intermediate code. The compiler covers a subset of the C language and uses Flex (lexer) and Bison (YACC parser) for analysis.
A Symbol Table is a data structure maintained by compilers to store information about the occurrence of various entities such as identifiers, objects, function names, etc. It helps in organizing the program's symbols and facilitates efficient lookup, insertion, and deletion operations.
The Symbol Table is implemented using Scope Tables. Each Scope Table contains information about symbols within a specific scope. Here are the key components:
- SymbolInfo: Stores information about a symbol, such as its type, value, and scope.
- ScopeTable: Maintains a hash table for symbols within a scope, allowing for quick lookup and management of symbols.
- SymbolTable: Manages multiple scopes, allowing for entering and exiting scopes, which is essential for handling nested scopes in a program.
class SymbolInfo {
public:
string name;
string type;
// Additional attributes can be added as needed
SymbolInfo(string name, string type) : name(name), type(type) {}
};
class ScopeTable {
private:
unordered_map<string, SymbolInfo*> table;
int id;
ScopeTable* parentScope;
public:
ScopeTable(int id, ScopeTable* parent) : id(id), parentScope(parent) {}
bool insert(SymbolInfo* symbol) {
if (table.find(symbol->name) == table.end()) {
table[symbol->name] = symbol;
return true;
}
return false;
}
SymbolInfo* lookup(string name) {
if (table.find(name) != table.end()) {
return table[name];
}
if (parentScope != nullptr) {
return parentScope->lookup(name);
}
return nullptr;
}
// Other methods for deletion, printing, etc.
};
class SymbolTable {
private:
ScopeTable* currentScope;
int currentScopeId;
public:
SymbolTable() : currentScopeId(1) {
currentScope = new ScopeTable(currentScopeId, nullptr);
}
void enterScope() {
currentScopeId++;
currentScope = new ScopeTable(currentScopeId, currentScope);
}
void exitScope() {
if (currentScope->getParentScope() != nullptr) {
ScopeTable* temp = currentScope;
currentScope = currentScope->getParentScope();
delete temp;
}
}
bool insert(SymbolInfo* symbol) {
return currentScope->insert(symbol);
}
SymbolInfo* lookup(string name) {
return currentScope->lookup(name);
}
// Other methods for managing scopes and symbols
};
Lexical analysis is the process of converting a sequence of characters into a sequence of tokens. This step is performed by the lexical analyzer (lexer), which identifies meaningful lexemes and classifies them into token types. In this project, we use flex
to define the lexer.
A flex file consists of three sections:
- Definitions: Define macros and include headers.
- Rules: Specify regular expressions and the corresponding actions.
- User Code: Additional C/C++ code used by the lexer.
%{
#include "y.tab.h"
%}
WHITESPACE [ \t\f\r\v]+
LETTER [a-zA-Z_]
DIGIT [0-9]
NEWLINE [\r]?\n
ALNUM [A-Za-z_0-9]
ALL_EXCEPT_BACKSLASH [^\\]
%%
{LETTER}{ALNUM}* {
yylval.symbol_info = new SymbolInfo(yytext, "ID");
return ID;
}
"+"|"-" {
yylval.symbol_info = new SymbolInfo(yytext, "ADDOP");
return ADDOP;
}
. {
// Handle other tokens
}
%%
In the example above, the lexer identifies identifiers and addition operators and returns the corresponding tokens to the parser.
-
Additional Notes
- Whitespaces and comments are identified but not passed to the parser.
- The lexer counts line numbers for newline characters.
Tokens with * are passed as SymbolInfo
objects to the parser, containing the matched lexeme and returned token.
The lexer detects and reports errors with corresponding line numbers. Types of errors include:
- Excessive decimal points, e.g.,
1.2.345
- Malformed numbers, e.g.,
1E10.7
- Invalid suffix on numeric constants or prefix on identifiers, e.g.,
12abcd
- Multi-character constants, e.g.,
‘ab’
- Unfinished characters, e.g.,
‘a
,‘\n
, or‘\’
- Empty character constants, e.g.,
''
- Unfinished strings, e.g.,
"this is an unfinished string
- Unfinished comments, e.g.,
/* This is an unfinished comment
- Unrecognized characters
The parser takes the sequence of tokens produced by the lexer and verifies that the sequence conforms to the grammar of the source language. This step is known as syntax analysis. Semantic analysis ensures that the program's constructs are semantically correct.
A Bison file consists of three sections:
- Definitions: Define tokens, data types, and precedence rules.
- Rules: Specify the grammar rules and corresponding actions.
- User Code: Additional C/C++ code used by the parser.
%{
#include "symbol_table.h"
void yyerror(const char* s);
%}
%union {
SymbolInfo* symbol_info;
// Other types
}
%token <symbol_info> ID ADDOP
%type <symbol_info> expression term factor
%%
start : program
;
program : program unit
| unit
;
unit : var_declaration
| func_declaration
| func_definition
;
var_declaration : type_specifier declaration_list ';'
;
type_specifier : INT
| FLOAT
| VOID
;
declaration_list : declaration_list ',' ID
| ID
;
%%
int main() {
yyparse();
return 0;
}
void yyerror(const char* s) {
fprintf(stderr, "Error: %s\n", s);
}
In the example above, the grammar defines rules for a simple C-like language, including variable declarations and type specifiers.
- Multiple functions with unique names, declared or defined before being called.
- No pre-processing directives like
#include
or#define
. - Variables can be declared inside functions or globally.
- Standard precedence and associativity rules apply, excluding consecutive logical or relational operators.
- No
break
orswitch-case
statements. println(int n)
is used instead ofprintf(“%d\n”, n)
.
Type Checking
- Verifies assignment operand compatibility.
- Ensures array indices are integers.
- Requires integer operands for the modulus operator.
- Validates function call arguments against definitions.
- Prohibits void functions in expressions.
Type Conversion
Errors on float-to-integer conversions in expressions. Logical and relational operations result in integers.
Uniqueness Checking
Ensures variables are declared and unique within their scope.
Array Index
Checks for correct usage of array indices.
Function Parameters
Validates the number and types of function call arguments, ensuring consistency with declarations.
- Basic Types:
CONST_INT
,CONST_FLOAT
,CONST_CHAR
,CONST_INT*
,CONST_FLOAT*
,CONST_CHAR*
- Special Types:
UNDEC
(undeclared IDs),ERROR
(expressions with errors),FUNC_VOID
(void function return types),VARIABLE
(other types)
Error recovery techniques allow the parser to continue processing after encountering an error, providing meaningful error messages and preventing abrupt termination. Bison provides mechanisms for handling syntax errors gracefully.
%{
#include "symbol_table.h"
void yyerror(const char* s);
%}
%token <symbol_info> ID ADDOP
%type <symbol_info> expression term factor
%nonassoc LOWER_THAN_ELSE
%nonassoc ELSE
%%
statement : IF '(' expression ')' statement %prec LOWER_THAN_ELSE
| IF '(' expression ')' statement ELSE statement
;
declaration_list : declaration_list ',' ID
| declaration_list error ',' ID
{
yyclearin;
yyerrok;
}
| ID
;
%%
void yyerror(const char* s) {
fprintf(stderr, "Error: %s\n", s);
}
In the example above, the parser handles syntax errors in declaration_list
by using the error
token and clearing the lookahead token to continue parsing.
Intermediate code generation translates the source code into an intermediate representation, which is then translated to machine code. In this project, we use 8086 assembly language as the intermediate representation.
Intermediate code is generated on-the-fly by writing to code.asm
as soon as a rule is matched. This involves using PUSH
and POP
instructions for stack operations.
Consider the grammar:
E -> E + T
E -> T
T -> T * F
T -> F
F -> id
F -> digit
For the expression a + 2 * c
, the lexer generates tokens id, +, digit, *, id
. The parser then matches rules and generates code as follows:
PUSH a
PUSH 2
PUSH c
POP BX
POP AX
MUL BX
PUSH AX
POP BX
POP AX
ADD AX, BX
PUSH AX
POP BX
POP AX
MOV d, BX
PUSH d
POP BX
Special handling is required for evaluating for
loops, saving the line number of the third expression and inserting corresponding code.
Refer to pages 303-305 (14.5.3 Using the stack for procedures) of "Assembly Language Programming and Organization of the IBM PC" by Ytha Yu, Charles Marut.
Variables are declared by pushing dummy values to the stack or adding entries to the data segment.
Variables are accessed by checking their scope and using appropriate addressing modes.
a ;global variable
b[0] ;global array
[BP + -2] ;local variable
[BP + -4] ;local array index 0
- Remove Redundant Instructions: Eliminate unnecessary
PUSH
andPOP
instructions. - Optimize
MOV
Instructions: Remove redundant moves. - Simplify Jumps and Comparisons: Remove unnecessary jump and compare instructions.
If you want to apply Backpatching in your ICG code, you need to understand how that works. For detailed explanation, see here
The repository serves as an archive for the author's solutions to the course assignments. The solutions are not guaranteed to be full-proof. The author is not responsible for any damage caused by the use of the solutions. While it can serve as a reference, it is strongly discouraged to copy the solutions for academic dishonesty, and the author is not responsible for any consequences of such actions.
Contributions are welcome! If you have any suggestions or improvements, feel free to create a pull request. Please ensure that your contributions adhere to the project guidelines.
This repository is licensed under the MIT License. See the LICENSE file for more information.