Skip to content

Commit

Permalink
add articles from kernel stash
Browse files Browse the repository at this point in the history
  • Loading branch information
michael-burke4 committed Jun 11, 2024
1 parent 0e958d3 commit f789294
Show file tree
Hide file tree
Showing 13 changed files with 2,675 additions and 5 deletions.
87 changes: 87 additions & 0 deletions _articles/C_program_to_executable.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
---
layout: article
title: "Compilation Process Deep Dive: How a C Program Becomes an Executable"
---
Take this C source file `hello.c`:

```
#include <stdio.h>
int main(void)
{
puts("Hello, world!");
return 0;
}
```

You should be familiar with the following invocation:

gcc hello.c -o hello

This is the most basic way to compile a C source file into an binary file you can execute.

Most of you are also familiar with breaking this process into two steps:

**Compilation**:

gcc -c hello.c -o hello.o

**Linking**:

gcc hello.o -o hello


This has advantages for large projects because the compilation can be done in parallel, and as you edit the code, only the files that you change need to be recompiled.

While *two* steps is enough for practical purposes (e.g. decreasing build time), it is not the full story.
In reality, the C compiler performs least _four_ distinct processes behind the scenes: preprocessing, compilation, assembly, and linking.

The command `gcc -c hello.c -o hello.o` encompasses the preprocessing, compilation, and assembly steps, while the command `gcc hello.o -o hello` encompasses the linking step.

We can invoke each step manually like so:

0. Preprocessing

cpp hello.c -o hello.i

The
[C preprocessor](https://en.wikipedia.org/wiki/C_preprocessor)
removes comments, collapses whitespace, and resolves macros.

The output is traditionally given the suffix `.i` which stands for intermediate.

0. Compilation

cc -S hello.i -o hello.s

This is where C language constructs like variables, types and control-flow are flattened into undifferentiated data and code.

After this point, we have no way to tell with certainty that this assembly output came from C program input. A compiler for a different language could plausibly generate identical assembly output.

0. Assembly

as hello.s -o hello.o

The instructions are replaced with their machine code equivalents. This part is reversible, but
the assembler also rips out the last remnants of structure leftover from the original C program source. Static data and functions lose their names and are referred to by only their address, and any exported or imported variables and functions become generic entries in a symbol table. All other labels (e.g. the target of a jump within the same function) are gone without a trace.

After this step, the output is no longer human readable text.

0. Linking

ld hello.o -l:crt1.o -lc -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o hello

Even though _our_ functions have been compiled into machine code that our CPU could in theory execute,
there is still work do be done. The linker collects the dependencies of our program
(the C startup runtime `-l:crt1.o` that provides the `_start` symbol and the C standard library `-lc` which provides the `puts` symbol) and bundles them into one file.
The linker makes connections between object files, by cross-referencing their symbol tables
to resolve previously unresolved symbols with their now known locations.

In reality symbol resolution is an instance of
the classic engineering trade-off between
execution speed and memory footprint.
Our C program, like most, is at least partially
[dynamically](https://en.wikipedia.org/wiki/Dynamic_linker)
linked at runtime (`-dynamic-linker /lib64/ld-linux-x86-64.so.2`).

The output is an executable ELF file that the kernel loader can load into memory and execute on a CPU.
217 changes: 217 additions & 0 deletions _articles/assembly_demo.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,217 @@
---
layout: article
title: Assembly demo
---
### Assembly code written in during a live lecture

```
.text
.globl _start
_start:
#write(1, QUESTION, sizeof(QUESTION) - 1);
mov $1, %rdi #stdout fileno
lea question, %rsi #pointer to string
mov $question_len, %rdx #length
mov $1, %rax #no. for write
syscall #do it!
cmp $0, %rax #check return value
jl error #if negative, error out
#read(0, buffer, sizeof(buffer));
mov $0, %rdi #stdin fileno
lea buffer, %rsi #pointer to buffer
mov $buffer_len, %rdx #length
mov $0, %rax #no. for read
syscall #do it!
push %rax #save return value
cmp $1, %rax #check return value
jle error #if <= 1, error out
#write(1, MESSAGE, sizeof(MESSAGE) - 1);
mov $1, %rdi #stdout fileno
lea hellomsg, %rsi #pointer to string
mov $hello_len, %rdx #length
mov $1, %rax #no. for write
syscall #do it!
cmp $0, %rax #check return value
jl error #if negative, error out
#write(1, buffer, (size_t)len);
mov $1, %rdi #stdout fileno
lea buffer, %rsi #pointer to buffer
pop %rdx #saved length
mov $1, %rax #no. for write
syscall #do it!
cmp $0, %rax #check return value
jl error #if <= 1, error out
mov $0, %rdi #exit status of 0
mov $60, %rax #no. for exit
syscall #do it!
error:
mov $2, %rdi #stderr fileno
lea errormsg, %rsi #pointer to string
mov $error_len, %rdx #length
mov $1, %rax #no. for write
syscall #do it!
mov $1, %rdi #exit status of 1
mov $60, %rax #no. for exit
syscall #do it!
.data
question:
.ascii "What is your name?\n"
.equ question_len, . - question
errormsg:
.ascii "error!\n"
.equ error_len, . - errormsg
buffer:
.equ buffer_len, 100
.space buffer_len, 0
hellomsg:
.ascii "Hello, "
.equ hello_len, . - hellomsg
```

### Similar prewritten example for both x86-64 and aarch64:

#### x86-64:

```
#include <syscall.h>
#define STDIN_FILENO 0
#define STDOUT_FILENO 1
.globl _start //make _start a global symbol so linker can find it
_start: //_start is entry point for all executibles
mov %rax, $SYS_write //%rax holds syscall number, 1 represents `write`
mov %rdi, $STDOUT_FILENO //%rdi holds first syscall arg, 1 represents `stdout`
lea %rsi, prompt //%rsi holds second arg, =prompt gets address if prompt string from data section
mov %rdx, $prompt_len //%rdx holds third arg, prompt_len is macro that expands to calculated size
syscall //perform a system call
cmp %rdi, $0 //check if return is negative
jl .out //if it is, exit program early with exit code based on return value
mov %rax, $SYS_read //0 represents `read`
mov %rdi, $STDIN_FILENO //0 represents `stdin`
ldr %rsi, =buffer //read into buffer
mov %rdx, $buffer_len //at most buffer_len bytes
syscall //perform syscall
cmp %rdi, $0 //check for error as above
jl .out
mov %rcx, %rdi //save returned length to only print that many bytes
mov %rax, $SYS_write //back to writing, send "Hello, " to stdout
mov %rdi, $STDOUT_FILENO
ldr %rsi, =msg
mov %rdx, $msg_len
syscall
cmp %rdi, $0 //check for error
jl .out
mov %rdi, $1 //need to set %rdi back to 1 because it was replaced with return code of last call
ldr %rsi, =buffer //whatever they input
mov %rdx, %rcx //and however long it was
syscall //send that
cmp %rdi, $0 //check for errors
jl .out
mov %rdi, $0 //if there was not an error, set return code to 0
.out: //otherwise we were sent here and %rdi already contains error code to return
mov %rax, $SYS_exit //60 represents exit
syscall //exit program
//exit syscall does not return, so _start function does not need to return to caller
.data //data section for strings
prompt: .ascii "What is your name? "
.equ prompt_len, .-prompt //.equ makes a new macro, `.` represents current location in binary, and subtracting the value of prompt gives how many bytes prompt contained
buffer: .space 64
.equ buffer_len, .-buffer
msg: .ascii "Hello, "
.equ msg_len, .-msg
.data
message:
.ascii "Hello, World!\n"
len = . - message
.text
.global _start
_start:
mov $1, %rdi
mov $message, %rsi
mov $len, %rdx
mov $1, %rax
syscall
mov $13, %rdi
mov $60, %rax
syscall
```
#### aarch64:

```
#include <syscall.h>
#define STDIN_FILENO 0
#define STDOUT_FILENO 1
.globl _start //make _start a global symbol so linker can find it
_start: //_start is entry point for all executibles
mov x8, #SYS_write //x8 holds syscall number, 64 represents `write`
mov x0, #STDOUT_FILENO //x0 holds first syscall arg, 1 represents `stdout`
ldr x1, =prompt //x1 holds second arg, =prompt gets address if prompt string from data section
mov x2, #prompt_len //x2 holds third arg, prompt_len is macro that expands to calculated size
svc #0 //perform a system call
cmp x0, #0 //check if return is negative
b.lt .out //if it is, exit program early with exit code based on return value
mov x8, #SYS_read //63 represents `read`
mov x0, #STDIN_FILENO //0 represents `stdin`
ldr x1, =buffer //read into buffer
mov x2, #buffer_len //at most buffer_len bytes
svc #0 //perform syscall
cmp x0, #0 //check for error as above
b.lt .out
mov x3, x0 //save returned length to only print that many bytes
mov x8, #SYS_write //back to writing, send "Hello, " to stdout
mov x0, #STDOUT_FILENO
ldr x1, =msg
mov x2, #msg_len
svc #0
cmp x0, #0 //check for error
b.lt .out
mov x0, #1 //need to set x0 back to 1 because it was replaced with return code of last call
ldr x1, =buffer //whatever they input
mov x2, x3 //and however long it was
svc #0 //send that
cmp x0, #0 //check for errors
b.lt .out
mov x0, #0 //if there was not an error, set return code to 0
.out: //otherwise we were sent here and x0 already contains error code to return
mov x8, #SYS_exit //93 represents exit
svc #0 //exit program
//exit syscall does not return, so _start function does not need to return to caller
.data //data section for strings
prompt: .ascii "What is your name? "
.equ prompt_len, .-prompt //.equ makes a new macro, `.` represents current location in binary, and subtracting the value of prompt gives how many bytes prompt contained
buffer: .space 64
.equ buffer_len, .-buffer
msg: .ascii "Hello, "
.equ msg_len, .-msg
```

### Example makefile for assembly with preproccessing

```
.PHONY: all clean
all:asm_hello
asm_hello: asm_hello.o
ld -o asm_hello asm_hello.o
asm_hello.o: asm_hello.s
as asm_hello.s -o asm_hello.o
asm_hello.s: asm_hello.S
cpp asm_hello.S -o asm_hello.s
clean:
-rm asm_hello.s asm_hello.o asm_hello
```

18 changes: 18 additions & 0 deletions _articles/everything_is_a_file.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
---
layout: article
title: Everything is a file (in Linux)
---
This elegant design principle dates back to
the beginning of (Unix) time, a.k.a.
[the 70s](https://en.wikipedia.org/wiki/January 1, 1970).
However, this simple principle is an
oversimplification - consider the existence
of directories.
In reality, the slogan
["Everything is a file"](https://en.wikipedia.org/wiki/Everything_is_a_file)
is a convenient shorthand for the more accurate
but less catchy notion that (almost) all
resources available to a process on a
[Unix-like](https://en.wikipedia.org/wiki/Unix-like)
operating system can be referenced by a
[file descriptor](https://en.wikipedia.org/wiki/File_descriptor).
37 changes: 37 additions & 0 deletions _articles/git_basics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---
layout: article
title: Introduction to Git
---
* We used [this](https://kdlp.underground.software/course/slides/git.html) slide deck

* Git is distributed [version control](https://en.wikipedia.org/wiki/Version_control) software

* [Git](https://git-scm.com/) is not [GitHub](https://github.com)

* GitHub is one implementation of an interface for git

* There are variously featured alternatives, such as [GitLab](https://gitlab.com/), [Bitbucket](https://bitbucket.org/), and [cgit](https://git.zx2c4.com/cgit/)

* The KDLP team maintain a custom-themed cgit instance [here](https://kdlp.underground.software/cgit)

* Git is built on a [tree-like data structure](https://en.wikipedia.org/wiki/Tree_(data_structure)) that contains the entire change history of a project

* **Git proficiency is of the most useful and valuable software engineering skills a computer science student can learn in preparation to enter the industry**

* Charlie did a demo in the terminal. Here's a rough outline of the various git commands he covered:

* `git clone`: Cloning the [ILKD_assignments](https://kdlp.underground.software/cgit/ILKD_assignments/) repository

* `git commit`: Committing new local changes to the repository

* `git merge`: Combining two change histories into one

* `git reset`: Undoing previous changes, and going nuclear with `--hard`

* `git rebase`: Rewriting the git history

* (single commit rewrite cases can be handled with `git commit -amend`)

* When things don't go right, you may have to resolve merge conflicts by manually editing source files and re-committing

* This should not be something you have to do for this course, however for anyone who is interested, here is an article on [merge conflicts](https://css-tricks.com/merge-conflicts-what-they-are-and-how-to-deal-with-them/)
Loading

0 comments on commit f789294

Please sign in to comment.