diff --git a/_articles/C_program_to_executable.md b/_articles/C_program_to_executable.md new file mode 100644 index 0000000..62fd445 --- /dev/null +++ b/_articles/C_program_to_executable.md @@ -0,0 +1,87 @@ +--- +layout: article +title: "Compilation Process Deep Dive: How a C Program Becomes an Executable" +--- +Take this C source file `hello.c`: + +``` +#include + +int main(void) +{ + puts("Hello, world!"); + return 0; +} +``` + +You should be familiar with the following invocation: + + gcc hello.c -o hello + +This is the most basic way to compile a C source file into an binary file you can execute. + +Most of you are also familiar with breaking this process into two steps: + +**Compilation**: + + gcc -c hello.c -o hello.o + +**Linking**: + + gcc hello.o -o hello + + +This has advantages for large projects because the compilation can be done in parallel, and as you edit the code, only the files that you change need to be recompiled. + +While *two* steps is enough for practical purposes (e.g. decreasing build time), it is not the full story. +In reality, the C compiler performs least _four_ distinct processes behind the scenes: preprocessing, compilation, assembly, and linking. + +The command `gcc -c hello.c -o hello.o` encompasses the preprocessing, compilation, and assembly steps, while the command `gcc hello.o -o hello` encompasses the linking step. + +We can invoke each step manually like so: + +0. Preprocessing + + cpp hello.c -o hello.i + +The +[C preprocessor](https://en.wikipedia.org/wiki/C_preprocessor) +removes comments, collapses whitespace, and resolves macros. + +The output is traditionally given the suffix `.i` which stands for intermediate. + +0. Compilation + + cc -S hello.i -o hello.s + +This is where C language constructs like variables, types and control-flow are flattened into undifferentiated data and code. + +After this point, we have no way to tell with certainty that this assembly output came from C program input. A compiler for a different language could plausibly generate identical assembly output. + +0. Assembly + + as hello.s -o hello.o + +The instructions are replaced with their machine code equivalents. This part is reversible, but +the assembler also rips out the last remnants of structure leftover from the original C program source. Static data and functions lose their names and are referred to by only their address, and any exported or imported variables and functions become generic entries in a symbol table. All other labels (e.g. the target of a jump within the same function) are gone without a trace. + +After this step, the output is no longer human readable text. + +0. Linking + + ld hello.o -l:crt1.o -lc -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o hello + +Even though _our_ functions have been compiled into machine code that our CPU could in theory execute, +there is still work do be done. The linker collects the dependencies of our program +(the C startup runtime `-l:crt1.o` that provides the `_start` symbol and the C standard library `-lc` which provides the `puts` symbol) and bundles them into one file. +The linker makes connections between object files, by cross-referencing their symbol tables +to resolve previously unresolved symbols with their now known locations. + +In reality symbol resolution is an instance of +the classic engineering trade-off between +execution speed and memory footprint. +Our C program, like most, is at least partially +[dynamically](https://en.wikipedia.org/wiki/Dynamic_linker) +linked at runtime (`-dynamic-linker /lib64/ld-linux-x86-64.so.2`). + +The output is an executable ELF file that the kernel loader can load into memory and execute on a CPU. diff --git a/_articles/assembly_demo.md b/_articles/assembly_demo.md new file mode 100644 index 0000000..b588500 --- /dev/null +++ b/_articles/assembly_demo.md @@ -0,0 +1,217 @@ +--- +layout: article +title: Assembly demo +--- +### Assembly code written in during a live lecture + +``` +.text +.globl _start +_start: + #write(1, QUESTION, sizeof(QUESTION) - 1); + mov $1, %rdi #stdout fileno + lea question, %rsi #pointer to string + mov $question_len, %rdx #length + mov $1, %rax #no. for write + syscall #do it! + cmp $0, %rax #check return value + jl error #if negative, error out + + #read(0, buffer, sizeof(buffer)); + mov $0, %rdi #stdin fileno + lea buffer, %rsi #pointer to buffer + mov $buffer_len, %rdx #length + mov $0, %rax #no. for read + syscall #do it! + push %rax #save return value + cmp $1, %rax #check return value + jle error #if <= 1, error out + + #write(1, MESSAGE, sizeof(MESSAGE) - 1); + mov $1, %rdi #stdout fileno + lea hellomsg, %rsi #pointer to string + mov $hello_len, %rdx #length + mov $1, %rax #no. for write + syscall #do it! + cmp $0, %rax #check return value + jl error #if negative, error out + + #write(1, buffer, (size_t)len); + mov $1, %rdi #stdout fileno + lea buffer, %rsi #pointer to buffer + pop %rdx #saved length + mov $1, %rax #no. for write + syscall #do it! + cmp $0, %rax #check return value + jl error #if <= 1, error out + + mov $0, %rdi #exit status of 0 + mov $60, %rax #no. for exit + syscall #do it! +error: + mov $2, %rdi #stderr fileno + lea errormsg, %rsi #pointer to string + mov $error_len, %rdx #length + mov $1, %rax #no. for write + syscall #do it! + mov $1, %rdi #exit status of 1 + mov $60, %rax #no. for exit + syscall #do it! + +.data +question: +.ascii "What is your name?\n" +.equ question_len, . - question +errormsg: +.ascii "error!\n" +.equ error_len, . - errormsg +buffer: +.equ buffer_len, 100 +.space buffer_len, 0 +hellomsg: +.ascii "Hello, " +.equ hello_len, . - hellomsg +``` + +### Similar prewritten example for both x86-64 and aarch64: + +#### x86-64: + +``` +#include + +#define STDIN_FILENO 0 +#define STDOUT_FILENO 1 + +.globl _start //make _start a global symbol so linker can find it +_start: //_start is entry point for all executibles + mov %rax, $SYS_write //%rax holds syscall number, 1 represents `write` + mov %rdi, $STDOUT_FILENO //%rdi holds first syscall arg, 1 represents `stdout` + lea %rsi, prompt //%rsi holds second arg, =prompt gets address if prompt string from data section + mov %rdx, $prompt_len //%rdx holds third arg, prompt_len is macro that expands to calculated size + syscall //perform a system call + cmp %rdi, $0 //check if return is negative + jl .out //if it is, exit program early with exit code based on return value + mov %rax, $SYS_read //0 represents `read` + mov %rdi, $STDIN_FILENO //0 represents `stdin` + ldr %rsi, =buffer //read into buffer + mov %rdx, $buffer_len //at most buffer_len bytes + syscall //perform syscall + cmp %rdi, $0 //check for error as above + jl .out + mov %rcx, %rdi //save returned length to only print that many bytes + mov %rax, $SYS_write //back to writing, send "Hello, " to stdout + mov %rdi, $STDOUT_FILENO + ldr %rsi, =msg + mov %rdx, $msg_len + syscall + cmp %rdi, $0 //check for error + jl .out + mov %rdi, $1 //need to set %rdi back to 1 because it was replaced with return code of last call + ldr %rsi, =buffer //whatever they input + mov %rdx, %rcx //and however long it was + syscall //send that + cmp %rdi, $0 //check for errors + jl .out + mov %rdi, $0 //if there was not an error, set return code to 0 +.out: //otherwise we were sent here and %rdi already contains error code to return + mov %rax, $SYS_exit //60 represents exit + syscall //exit program + //exit syscall does not return, so _start function does not need to return to caller + +.data //data section for strings +prompt: .ascii "What is your name? " +.equ prompt_len, .-prompt //.equ makes a new macro, `.` represents current location in binary, and subtracting the value of prompt gives how many bytes prompt contained +buffer: .space 64 +.equ buffer_len, .-buffer +msg: .ascii "Hello, " +.equ msg_len, .-msg + +.data +message: + .ascii "Hello, World!\n" + len = . - message +.text +.global _start +_start: + mov $1, %rdi + mov $message, %rsi + mov $len, %rdx + mov $1, %rax + syscall + mov $13, %rdi + mov $60, %rax + syscall +``` +#### aarch64: + +``` +#include + +#define STDIN_FILENO 0 +#define STDOUT_FILENO 1 + +.globl _start //make _start a global symbol so linker can find it +_start: //_start is entry point for all executibles + mov x8, #SYS_write //x8 holds syscall number, 64 represents `write` + mov x0, #STDOUT_FILENO //x0 holds first syscall arg, 1 represents `stdout` + ldr x1, =prompt //x1 holds second arg, =prompt gets address if prompt string from data section + mov x2, #prompt_len //x2 holds third arg, prompt_len is macro that expands to calculated size + svc #0 //perform a system call + cmp x0, #0 //check if return is negative + b.lt .out //if it is, exit program early with exit code based on return value + mov x8, #SYS_read //63 represents `read` + mov x0, #STDIN_FILENO //0 represents `stdin` + ldr x1, =buffer //read into buffer + mov x2, #buffer_len //at most buffer_len bytes + svc #0 //perform syscall + cmp x0, #0 //check for error as above + b.lt .out + mov x3, x0 //save returned length to only print that many bytes + mov x8, #SYS_write //back to writing, send "Hello, " to stdout + mov x0, #STDOUT_FILENO + ldr x1, =msg + mov x2, #msg_len + svc #0 + cmp x0, #0 //check for error + b.lt .out + mov x0, #1 //need to set x0 back to 1 because it was replaced with return code of last call + ldr x1, =buffer //whatever they input + mov x2, x3 //and however long it was + svc #0 //send that + cmp x0, #0 //check for errors + b.lt .out + mov x0, #0 //if there was not an error, set return code to 0 +.out: //otherwise we were sent here and x0 already contains error code to return + mov x8, #SYS_exit //93 represents exit + svc #0 //exit program + //exit syscall does not return, so _start function does not need to return to caller + +.data //data section for strings +prompt: .ascii "What is your name? " +.equ prompt_len, .-prompt //.equ makes a new macro, `.` represents current location in binary, and subtracting the value of prompt gives how many bytes prompt contained +buffer: .space 64 +.equ buffer_len, .-buffer +msg: .ascii "Hello, " +.equ msg_len, .-msg +``` + +### Example makefile for assembly with preproccessing + +``` +.PHONY: all clean +all:asm_hello + +asm_hello: asm_hello.o + ld -o asm_hello asm_hello.o + +asm_hello.o: asm_hello.s + as asm_hello.s -o asm_hello.o + +asm_hello.s: asm_hello.S + cpp asm_hello.S -o asm_hello.s + +clean: + -rm asm_hello.s asm_hello.o asm_hello +``` + diff --git a/_articles/everything_is_a_file.md b/_articles/everything_is_a_file.md new file mode 100644 index 0000000..75e0b43 --- /dev/null +++ b/_articles/everything_is_a_file.md @@ -0,0 +1,18 @@ +--- +layout: article +title: Everything is a file (in Linux) +--- +This elegant design principle dates back to +the beginning of (Unix) time, a.k.a. +[the 70s](https://en.wikipedia.org/wiki/January 1, 1970). +However, this simple principle is an +oversimplification - consider the existence +of directories. +In reality, the slogan +["Everything is a file"](https://en.wikipedia.org/wiki/Everything_is_a_file) +is a convenient shorthand for the more accurate +but less catchy notion that (almost) all +resources available to a process on a +[Unix-like](https://en.wikipedia.org/wiki/Unix-like) +operating system can be referenced by a +[file descriptor](https://en.wikipedia.org/wiki/File_descriptor). diff --git a/_articles/git_basics.md b/_articles/git_basics.md new file mode 100644 index 0000000..7fbdd70 --- /dev/null +++ b/_articles/git_basics.md @@ -0,0 +1,37 @@ +--- +layout: article +title: Introduction to Git +--- +* We used [this](https://kdlp.underground.software/course/slides/git.html) slide deck + +* Git is distributed [version control](https://en.wikipedia.org/wiki/Version_control) software + +* [Git](https://git-scm.com/) is not [GitHub](https://github.com) + + * GitHub is one implementation of an interface for git + + * There are variously featured alternatives, such as [GitLab](https://gitlab.com/), [Bitbucket](https://bitbucket.org/), and [cgit](https://git.zx2c4.com/cgit/) + + * The KDLP team maintain a custom-themed cgit instance [here](https://kdlp.underground.software/cgit) + +* Git is built on a [tree-like data structure](https://en.wikipedia.org/wiki/Tree_(data_structure)) that contains the entire change history of a project + +* **Git proficiency is of the most useful and valuable software engineering skills a computer science student can learn in preparation to enter the industry** + +* Charlie did a demo in the terminal. Here's a rough outline of the various git commands he covered: + + * `git clone`: Cloning the [ILKD_assignments](https://kdlp.underground.software/cgit/ILKD_assignments/) repository + + * `git commit`: Committing new local changes to the repository + + * `git merge`: Combining two change histories into one + + * `git reset`: Undoing previous changes, and going nuclear with `--hard` + + * `git rebase`: Rewriting the git history + + * (single commit rewrite cases can be handled with `git commit -amend`) + + * When things don't go right, you may have to resolve merge conflicts by manually editing source files and re-committing + + * This should not be something you have to do for this course, however for anyone who is interested, here is an article on [merge conflicts](https://css-tricks.com/merge-conflicts-what-they-are-and-how-to-deal-with-them/) diff --git a/_articles/kernel_basics.md b/_articles/kernel_basics.md new file mode 100644 index 0000000..609ac85 --- /dev/null +++ b/_articles/kernel_basics.md @@ -0,0 +1,153 @@ +--- +layout: article +title: Linux Kernel Basics +--- +#### Fundamental difference: CPU privilege level at a given time + +* On a Linux system, when the CPU is executing code in a fully privileged mode, we say that the CPU is executing the code in kernelspace + +* On a Linux system, When the CPU is executing code at a restricted privilege level, we say that the CPU is executing the code in userspace + +#### What does CPU privilege enable? +* Set behavior on [trap](https://en.wikipedia.org/wiki/Interrupt#Terminology) +* Set behavior on [interrupt](https://en.wikipedia.org/wiki/Interrupt) + * A quick explanation of [both](https://stackoverflow.com/questions/3149175/what-is-the-difference-between-trap-and-interrupt) +* Set virtual-physical address mappings, i.e. configure [page tables](https://en.wikipedia.org/wiki/Page_table) +* Execute privileged instructions + * See [this table](https://www.felixcloutier.com/x86/) as an x86 instruction reference + * We discussed [RDMSR](https://www.felixcloutier.com/x86/rdmsr) as an example + * Unpriveleged execution of this instruction [raises](https://www.felixcloutier.com/x86/rdmsr#protected-mode-exceptions) the `#GP(0)` [CPU exception](https://wiki.osdev.org/Exceptions). + * This is a [General Protection Exception](https://wiki.osdev.org/Exceptions#General_Protection_Fault) + * There are three types of CPU exceptions: [faults](https://wiki.osdev.org/Exceptions#Faults), [traps](https://wiki.osdev.org/Exceptions#Traps), and [aborts](https://wiki.osdev.org/Exceptions#Aborts). + * The website linked in the previous bulled, wiki.osdev.org, has a huge about of information on this subject for the interested reader +* We experimented bit of assembly code to test out the exception handling codepaths of the Linux kernel. + +#### Exceptions review + +We've previously demonstrated how our attempts +to execute an invalid instruction or a privileged +instruction while in user mode causes a CPU +exception. At the hardware level, the CPU +immediately switches it's privilege mode to kernel mode +and jumps at corresponding kernel function installed +at boot to handle handle the exception. In this case, +the handler function prints an error message to the +kernel ring buffer and kills our program. + + +To give a couple more examples: +Software conditions such as dividing by zero or accessing an unaligned address trigger CPU exceptions. Hardware can of course also interrupt CPU execution by changing voltage on CPU pins. Finally, attempting to access a pointer +([virtual memory address](https://en.wikipedia.org/wiki/Virtual_memory)) for which the +[memory-mapping unit](https://en.wikipedia.org/wiki/Memory_management_unit) does not have a corresponding physical address triggers an +[page fault](https://wiki.osdev.org/Exceptions#Page_Fault) +exception which the kernel may resolve by setting up appropriate mapping (e.g. memory that was lazily allocated or swapped to disk), or by sending the program the `SIGSEV` signal otherwise known as a "Segmentation Fault". + +#### Userspace Demo + +Here's a short AT&T-style x86 assembly file we can use to generate a binary that will attempt to execute a privileged instruction: + +``` +global _start ; declare the _start symbol to have exernal linkage for visibility of linker +_start: ; the true entry point for an x86 executable program + rdmsr ; execute the RDMSR instruction +``` + +Build the object file `rdmsr.o` from [`rdmsr.src`](https://kdlp.underground.software/cgit/priv_rdmsr_demo/tree/rdmsr.src) with: + +`as -o rdmsr.o rdmsr.src` + +Create the linked executable binary `rdmsr` from `rdmsr.o` with: + +`ld -o rdmsr rdmsr.o`. + +Invocation of this binary by `./rdmsr` should trigger a protection fault. + +More information on the [`#UD` Invalid Opcode](https://wiki.osdev.org/Exceptions#Invalid_Opcode) exception. + +#### Kernelspace Demo + +With a small kernel module, we can get Linux to run the same instruction in kernelspace: + +``` +#include +#include +MODULE_LICENSE("GPL"); +static int priv_demo_init(void) { + /* arbitrary poison values */ + int result_lower_32 = -0xAF, result_upper_32 = -0xBF; + pr_info("EDX:EAX := MSR[ECX];"); + asm ( "rdmsr" + : "=r" (result_upper_32), "=r" (result_lower_32) : : ); + pr_info("rdmsr: EDX=0x%x, EAX=0x%x\n", + result_lower_32, result_upper_32); + return 0; +} +static void priv_demo_exit(void) { + pr_info("rdmsr exiting"); +} +module_init(priv_demo_init); +module_exit(priv_demo_exit) +``` + +We can build this with the same Makefile as shown [here on the E2 page](https://kdlp.underground.software/course/fall2023/assignments/E2.md). + +#### Fully Automated demo + +We created fully automated demo of privileged and unprivileged instruction execution. +To acquire and run this demo, enter your VM and run `git clone https://kdlp.underground.software/cgit/priv_rdmsr_demo/` and run `make` inside the directory. + +### Further look at kernelspace vs userspace demo + +We took another look at the demo we posted on [L05](L05.md) after class. + +That demo can be found +[here](https://kdlp.underground.software/cgit/priv_rdmsr_demo/) and obtained by running: + + git clone https://kdlp.underground.software/cgit/priv_rdmsr_demo + +Ensure that you are comfortable with some of the introductory details +we discussed in [L05](L05.md). + +Recall from [L05](L05.md) that a trap is a type of CPU exception. + +We browsed the source for the Linux implementation of trap handling to understand the codepath that executes when the user executes the "UD2" instruction and prints a message to the kernel ring buffer (`dmesg`). + +Th address of the handler for this exception is defined in +[arch/x86/kernel/traps.c](https://elixir.bootlin.com/linux/v6.5.5/source/arch/x86/kernel/traps.c), as +[`exc_invalid_op`](https://elixir.bootlin.com/linux/v6.5.5/source/arch/x86/kernel/traps.c#L336). +Elsewhere, the corresponding row of the +[IDT](https://wiki.osdev.org/IDT) +is set to this address, so when the exception is generated, +[`handle_invalid_op`](https://elixir.bootlin.com/linux/v6.5.5/source/arch/x86/kernel/traps.c#L292) is called. + +If you are interested in the IDT then may also be interested in the +[GDT](https://wiki.osdev.org/GDT). + +Linux implements a lot of x86-specific IDT related code in +[arch/x86/kernel/idt.c](https://elixir.bootlin.com/linux/v6.5.5/source/arch/x86/kernel/idt.c). + +### Intro to kernelspace + +To begin, we used parts of the [Kernel Modules and Device Drivers](https://kdlp.underground.software/course/slides/modules_drivers.html) slide deck. + +* The slides are a little bit out of sync with how we have re-arranged the course and we have not yet reached device driver development. + +* The last three slides were the most relevant, however students may be interested in taking a look at the rest of it. + +* The kernel uses a small, fixed-size stack, compared to the larger, extendable stack used by userspace programs. + +* The C library, itself being a userspace program, is not available in kernelspace. Instead, many -- but importantly not all -- are implemented within the kernel. + +* For example, the IEEE754 floating point storage type that we all know and love from userspace C programming is entirely banned from the kernel. + +* The reason is that when CPU switches between kernelspace and userspace, it has to save and restore it's execution state to remember where it left off, and saving and restoring the floating point registers is considered to be too much overhead. + +* The kernel uses a different range of the address space than userspace. On x86_64 systems, the virtual address space is generally split in half + +#### The most important takeaway: kernel code is **reentrant** + +* Definition: A computer program is considered **reentrant** if and only if multiple concurrent executions of the same program always run correctly. + +* Further information can be found on the [reentrancy](https://en.wikipedia.org/wiki/Reentrancy_(computing)) Wikipedia page. + +* Assume that *any* line of code in the kernel can be running at *any* time with *any* number of concurrent executions of the same code. diff --git a/_articles/linux_basics.md b/_articles/linux_basics.md new file mode 100644 index 0000000..58d98af --- /dev/null +++ b/_articles/linux_basics.md @@ -0,0 +1,43 @@ +--- +layout: article +title: Linux System Basics +--- +### A time saver for vim users + +Append these two lines to your `~/.vimrc` to auto-highlight whitespace errors in your editor: + +``` +:highlight ExtraWhitespace ctermbg=red guibg=red +:match ExtraWhitespace /\s\+$/ +``` + +### Linux crash course and minimal distro + +* We started with the [Linux Crash Course](https://kdlp.underground.software/course/slides/linux_crash_course.html) slide deck. +* First, we gave the general what, where, why, and who of Linux + * Free and open source operating system kernel and ecosystem + * Running on many systems large and small, fast and slow, distributed and centralized + * Many pre-packaged combinations of system components capable of running user applications are available + * We call one of these a Linux distribution, or distro for short. + * Some examples are Fedora, Ubuntu, Arch Linux, RHEL, Puppy, and TAILS +* Having motivated our discussion of system components, we proceeded to discuss each one briefly + * We quickly went through the bootloader, kernel, C standard library, shared libraries, storage configuration, and filesystem hierarchy layout + * As we were about to proceed to the demo, we didn't go into great detail + * Take note of the availability of manpages: run `man man` and read through the description. +* After finishing the slide deck, we build a minimal Linux distribution live + * Using a ruthlessly minimal `.config` file, we built the kernel in noticably less time than it took to clone, even with the `--depth=1` option. Barely minutes. + * Then, we tested this kernel with `qemu-system`, generating a kernel panic as expected + * This is because the kernel attempts to start an inital userspace process and no such thing exists on the system + * On many Linux systems, systemd is this first "init" process, and it is given PID `1` + * We then created a root filesystem and a "hello world" init process + * This native assembly simply prints hello world + * Take note that the entry point is `_start` and not `main` + * Normally this is hidden as `libc` does various init things between `_start` and the user-defined `main` + * We used `strace` to see how our little app talks to the kernel + * In order to pass the app binary into the kernel, we use `cpio` to create an archive usable as an inital ram disk for our system, which allows the kernel to run the app as an init process + * This is generally refered to by the abbreviation `initrd`. + * `initrd` is an area of RAM that provides a storage device interface to the rest of the kernel. +* Since there were no questions, we attempted to make our minimal Linux distro interactive + * We implemented a parameter allowing a user to specify a string to print after "hello" instead of "world" + * The feature started working right in time for the end of class + diff --git a/_articles/module_translation.md b/_articles/module_translation.md new file mode 100644 index 0000000..f12fc9e --- /dev/null +++ b/_articles/module_translation.md @@ -0,0 +1,1369 @@ +--- +layout: article +title: Translating a User Program Into Kernel Code +--- +We begin with a simple but non-trivial user program: + +```c +$ cat user_code.c +#include +#include +#include +#include +#include +#include + +struct example +{ + char *message; + size_t size; +}; + +static struct example *example_create(const char *msg) +{ + struct example *ex = malloc(sizeof *ex); + if(!ex) + goto out; + ex->size = strlen(msg); + ex->message = strdup(msg); + if(!ex->message) + goto out_free; + return ex; +out_free: + free(ex); + ex = NULL; +out: + return ex; +} + +static void example_destroy(struct example *ex) +{ + free(ex->message); + free(ex); +} + +static bool example_update_message(struct example *ex, const char *msg) +{ + size_t size = strlen(msg); + char *data = strdup(msg); + if(!data) + return false; + free(ex->message); + ex->message = data; + ex->size = size; + return true; +} + +static char *example_get_message(struct example *ex) +{ + return ex->message; +} + +int main(void) +{ + struct example *ex = example_create("hello"); + if(!ex) + err(1, "unable to allocate memory"); + printf("%s\n", example_get_message(ex)); + if(!example_update_message(ex, "goodbye")) { + int temperrno = errno; + example_destroy(ex); + errno = temperrno; + err(1, "unable to update"); + } + printf("%s\n", example_get_message(ex)); + example_destroy(ex); + return 0; +} + +``` + +Before we proceed, let's note a few key features. + +**Data flow** + +The program works with structured data, primarily in the form of `struct example`: + +```c +struct example +{ + char *message; + size_t size; +}; +``` + +This pair of elements represents a simple byte string and its size. +Take note that both the data structure itself +and the memory located at `message` +can be allocated either statically or dynamically, +and we take care to ensure that these two layers are handled appropriately. + +Our typical userspace entry point, +the `main` function, +declares a pointer to one of these `struct example` types +and then immediately assigns the return value of a constructor-style +function `example_create()`, +whose job is to encapsulate the finer details of allocation and initialization. + +In good style, `main` is responsible for cleaning up its own mess, +and this task is executed right before `main` returns back to the C library +at the bottom of the function by invocation of `example_destroy()`. +When implementing a more complex program, we may pass a pointer to our +local reference in order to zero the value to avoid subsequent misuse by the caller, +i.e. a dangling pointer, however this is unnecessary complexity for this simple example +and it suffices to simply ensure that our program does not leak memory. +Usage of the userspace tool `valgrind` will validate this property of our program, +but we do admit for a short-lived program such as this example whose memory is cleaned up +by the kernel at termination, the fuss and rigor around memory leaks appears pedantic beyond +the practice of good habits. Though practice is reason enough, +we will soon find ourselves in kernelspace where there is no one to clean up after us. +In the kernel, a memory leak will persist until reboot and in the meantime will clog the tubes of the +[memory allocator](https://lwn.net/Articles/229984/). + + +**Control Flow** + +Our example program implements a control flow that should +not raise the eyebrows of a C programmer with beyond novice-level skill. +We don't do anything fancy with the entry point, +and we don't create any threads. +We invoke a constructor to allocate our memory in fairly standard form, +using the old reliable `malloc` function from the `` section +of the trusty C library. During instantiation, we make a couple of calls +to the `` section in the form of `strlen()` and `strdup()`, +both of which assume as a precondition a nicely null-terminated input string +as the `msg` parameter. Likewise, we perform the same operations +in `example_update_message()`, assuming the same precondition. + +Each call to `malloc()` pairs with a corresponding call to `free()`, +both at the level of the allocated message and the data structure itself, +and in just the same pattern our `example_create()` constructor function pairs +with our `example_destroy()` destructor function. +The `example_get_message()` implements a getter and `example_update_message()` +implements a setter. The complexity of the latter is due to the need to duplicate the +byte-string `msg` argument and free the now-junk memory residing at the +address contained in `ex->message`. + +**Error Flow** + +A careful reader of our example may take alarm at a particular feature. +We too have heard these rumors, that the C `goto` statement is considered +["harmful"](https://homepages.cwi.nl/~storm/teaching/reader/Dijkstra68.pdf). +Despite these tall tales, we inform you with confidence that while there +are many paths to correct code, the +[shortest path](https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm) +to readability and maintainability is often by use of this fearsome little keyword. +For one, correct usage of `goto` and error case labeling as seen in our example +eliminates the need for repetitive code and unnecessary indentation. +As it is +[written](https://www.kernel.org/doc/Documentation/process/coding-style.rst): +"if you need more than 3 levels of indentation, you're screwed anyway and you should fix your program". +We will not elaborate any +[further](https://www.cs.utexas.edu/users/EWD/transcriptions/EWD10xx/EWD1009.html). + +Next, note our usage of `err()` from ``. +This handy tool lets us perform the work of `perror()` and `exit` +with a single invocation. +We first pass the return code we would have handed to `exit` +and then we specify the string snatched from the jaws of `perror()`. + +The final point worth noting is our usage of `temperrno`. +Think of this as if we were "pushing" the value of `errno` +at that instance onto the stack, +like we would do at the assembly level for a register +before a jump or call into a section of code that may clobber said register. +Usage of the C library function `free()` +in our call to `example_destroy()` may overwrite the previous value of `errno`, +but this is the relevant value to report in the context +of cleaning up after a failed call to `example_update_message()`. + + +**System Flow** + +The program sends the following text to `stdout` when run: + +``` +$ ./user_example +hello +goodbye +``` + +Other than this, the program does not interact with the system +in any manner worth noting. + +#### Leaving Kansas + +Now that we have analyzed the user code with +excruciatingly thorough exposition, +let us turn to the primary task at hand. + +In order to satisfy what we assume to be our reader's +ravenous appetite for kernel module code and alleviate +the all-too-familiar pangs of hunger for privileged execution, +we'll begin by dropping the complete `diff -up` output between +the above program and its kernel equivalent: + +```diff +$ diff -Naup user_code.c kernel_code.c +--- user_code.c 2023-11-07 23:30:25.792075105 -0500 ++++ kernel_code.c 2023-11-07 23:30:16.628563819 -0500 +@@ -1,9 +1,6 @@ +-#include +-#include +-#include +-#include +-#include +-#include ++#include ++#include ++#include + + struct example + { +@@ -13,16 +10,16 @@ struct example + + static struct example *example_create(const char *msg) + { +- struct example *ex = malloc(sizeof *ex); ++ struct example *ex = kmalloc(sizeof *ex, GFP_KERNEL); + if(!ex) + goto out; + ex->size = strlen(msg); +- ex->message = strdup(msg); ++ ex->message = kstrdup(msg, GFP_KERNEL); + if(!ex->message) + goto out_free; + return ex; + out_free: +- free(ex); ++ kfree(ex); + ex = NULL; + out: + return ex; +@@ -30,17 +27,17 @@ out: + + static void example_destroy(struct example *ex) + { +- free(ex->message); +- free(ex); ++ kfree(ex->message); ++ kfree(ex); + } + + static bool example_update_message(struct example *ex, const char *msg) + { + size_t size = strlen(msg); +- char *data = strdup(msg); ++ char *data = kstrdup(msg, GFP_KERNEL); + if(!data) + return false; +- free(ex->message); ++ kfree(ex->message); + ex->message = data; + ex->size = size; + return true; +@@ -51,20 +48,39 @@ static char *example_get_message(struct + return ex->message; + } + +-int main(void) ++int example_init(void) + { ++ int ret = -ENOMEM; ++ const char *msg; + struct example *ex = example_create("hello"); ++ msg = KERN_ERR "unable to allocate memory"; + if(!ex) +- err(1, "unable to allocate memory"); +- printf("%s\n", example_get_message(ex)); +- if(!example_update_message(ex, "goodbye")) { +- int temperrno = errno; +- example_destroy(ex); +- errno = temperrno; +- err(1, "unable to update"); +- } +- printf("%s\n", example_get_message(ex)); ++ goto out; ++ ++ pr_info("%s\n", example_get_message(ex)); ++ ++ msg = KERN_ERR "unable to update\n"; ++ if(!example_update_message(ex, "goodbye")) ++ goto out_free; ++ ++ pr_info("%s\n", example_get_message(ex)); ++ ++ ret = 0; ++ msg = NULL; ++out_free: + example_destroy(ex); +- return 0; ++out: ++ if(msg) ++ printk(msg); ++ return ret; ++} ++ ++void example_exit(void) ++{ + } + ++module_init(example_init); ++module_exit(example_exit); ++ ++MODULE_LICENSE("GPL"); ++ + +``` + +The length of this `diff` output exceeds the length of the original user program. +We will proceed with an explanation of each change. + +#### Welcome to Oz + +The transition to writing kernel code is a shift to another plane of reality. +Previous assumptions about what a C program looks like may no longer hold, +and the reader may encounter strange looking constructs and ludicrously +deep layers of macro invocations, generating the sense of a fever dream. +When all appears to be lost, bear in mind one key point: +There is no escape from the kernel. +The kernel has been running since the CPU exited the bootloader +and only a semi-magical illusion has hidden this raw truth from your eyes. +Today, we lift this curse from the reader, +revealing, as the scales fall from their eyes, +the vibrant glory of kernel module code, +and forever dispelling the last remnant +of prestidigitation from their mental model of the computer. +Magic no more! The entirety of the machine, +software and hardware stack united as one, +lies bare before the attentive reader, +and nothing, save polynomial time factoring of large numbers, +remains beyond reach. + +Well then, lets get started. + +**Switch to kernel headers** + +First off, the C standard library is not available +within the kernel, so we discard the inclusion +of the header files that provide C library +declarations: + +``` +-#include +-#include +-#include +-#include +-#include +-#include ++#include ++#include ++#include +``` + +Instead, we include headers declaring +Linux kernel API entry points. +These paths are relative to the `include` +directory within the kernel repository. + +The first, +[](https://elixir.bootlin.com/linux/latest/source/include/linux/module.h), +provides the basic building blocks for a kernel module, +such as `#define`s of the `module_init()` and `module_exit()` macros we encounter later on. +Importantly, this file also `#define`s the mandatory `MODULE_LICENSE()` macro, +which we will return to at the end, as well as `printk()` and the associated macros. + +Next, we include +[](https://elixir.bootlin.com/linux/latest/source/include/linux/string.h) +to replace some of the functionality we accessed via the C library's `string.h`. +Some of the functions retain their familiar names, like `strlen()`, while others +like `kstrdup()` take on new names and new arguments. + +Finally, in order to allocate and free memory, we include +[](https://elixir.bootlin.com/linux/latest/source/include/linux/slab.h), +which gives us the duo of `kmalloc()` and `kfree()`, +second cousins of the familiar userspace versions. + +That's all for the `#include`s. +Here we can briefly note that the `struct example` we defined in userspace +is perfectly suitable for usage in kernelspace, so we skip right over it. + +**Memory allocation with a twist** + +``` + struct example + { +@@ -13,16 +10,16 @@ struct example + + static struct example *example_create(const char *msg) + { +``` + +Now, we arrive at our first usage of `kmalloc()`. +Like userspace `malloc()`, +this function takes a number of bytes to allocate +as its first argument, but `kmalloc()` takes a mysterious +second argument. In fact, this is the same argument +passed as the mysterious second argument to `kstrdup()`. +Luckily for the simplicity of this paragraph, `kfree()` +works exactly like `free()`. + +``` +- struct example *ex = malloc(sizeof *ex); ++ struct example *ex = kmalloc(sizeof *ex, GFP_KERNEL); + if(!ex) + goto out; + ex->size = strlen(msg); +- ex->message = strdup(msg); ++ ex->message = kstrdup(msg, GFP_KERNEL); + if(!ex->message) + goto out_free; + return ex; + out_free: +- free(ex); ++ kfree(ex); + ex = NULL; + out: + return ex; +@@ -30,17 +27,17 @@ out: + + static void example_destroy(struct example *ex) + { +- free(ex->message); +- free(ex); ++ kfree(ex->message); ++ kfree(ex); + } + + static bool example_update_message(struct example *ex, const char *msg) + { + size_t size = strlen(msg); +- char *data = strdup(msg); ++ char *data = kstrdup(msg, GFP_KERNEL); + if(!data) + return false; +- free(ex->message); ++ kfree(ex->message); + ex->message = data; + ex->size = size; + return true; +@@ -51,20 +48,39 @@ static char *example_get_message(struct + return ex->message; + } +``` + +The changes to the three functions `example_init()`, +`example_destroy()`, and `example_update_message()` are +all limited to these three substitutions, two of which introduce +this mysterious second `GFP_KERNEL` argument. +We will pause here to discuss this in more depth before getting into +the real funky stuff. + +We find the declaration of kmalloc in the latter half of +[`include/linux/slab.h`](https://elixir.bootlin.com/linux/v6.6/source/include/linux/slab.h#L590) +and the included comment provides us with far more articulate explication than we could muster. + +We include a snippet of the +[Linux `v6.6` kmalloc comment](https://elixir.bootlin.com/linux/v6.6/source/include/linux/slab.h#L548) +verbatim: + +``` + * The @flags argument may be one of the GFP flags defined at + * include/linux/gfp_types.h and described at + * :ref:`Documentation/core-api/mm-api.rst ` + * + * The recommended usage of the @flags is described at + * :ref:`Documentation/core-api/memory-allocation.rst ` + * + * Below is a brief outline of the most useful GFP flags + * + * %GFP_KERNEL + * Allocate normal kernel ram. May sleep. + * + * %GFP_NOWAIT + * Allocation will not sleep. + * + * %GFP_ATOMIC + * Allocation will not sleep. May use emergency pools. + * + * Also it is possible to set different flags by OR'ing + * in one or more of the following additional @flags: + * + * %__GFP_ZERO + * Zero the allocated memory before returning. Also see kzalloc(). + * + * %__GFP_HIGH + * This allocation has high priority and may use emergency pools. + * + * %__GFP_NOFAIL + * Indicate that this allocation is in no way allowed to fail + * (think twice before using). + * + * %__GFP_NORETRY + * If memory is not immediately available, + * then give up at once. + * + * %__GFP_NOWARN + * If allocation fails, don't issue any warnings. + * + * %__GFP_RETRY_MAYFAIL + * Try really hard to succeed the allocation but fail + * eventually. +``` + + +The curious reader should feel free to pursue any rabbit hole +referenced within that comment. + +The signature of `kmalloc()` itself is quite simple when the funny business is hidden: + + void *kmalloc(size_t size, gfp_t flags) + +The second argument is a `typedef`ed wrapper for what is really nothing more than a fancy +[`unsigned int`](https://elixir.bootlin.com/linux/v6.6/source/include/linux/types.h#L154), +but, in good style, +these implementation details are hidden from us +unless we search for them. +Essentially, this second `flags` argument is used to specify additional options +to the memory allocator. +One could easily implement such a compact bit-flags argument in userspace, +and certainly many of our readers have done so, +but we understand the confusion a novice kernel programmer may encounter +when forced to select options from a menu of foreign-language items in order +to perform a task as apparently simple as memory allocation. + +Let us back up a couple of steps and motivate this complexity. +As we noted in our discussion of the userspace program in the "Data Flow" section, +there is no other process within a system who will come save the kernel. +Without expanding the scope of our analysis +[beyond a single system](https://en.wikipedia.org/wiki/Virtual_machine) +or into the realm of +[exotic hardware](https://en.wikipedia.org/wiki/Intel_Active_Management_Technology), +we must operate under the knowledge that +the kernel is the sovereign and absolute monarch of a computer system +from the time that the bootloader kindly requests that the CPU jump into the kernel code +to the time the computer is either reset or physically destroyed. +While this absolute authority grants the CPU the enjoyment of maximally privileged execution, +this absolute responsibility yokes the CPU with the burden of maximally privileged execution. + +When we write kernel code, in this case a kernel module that allocates and frees memory, +we can't just blindly type up some half-baked garbage willy-nilly +and grind out a compile/valgrind/debug loop until all the errors are ironed out. +Certainly +[there are tools](https://docs.kernel.org/dev-tools/kmemleak.html) +for searching the kernel for memory leaks, +but the instrumentation of the kernel is not nearly as trivial +as the runtime instrumentation performed by `valgrind`. +To zoom into our particular context, take a closer look at the three `GFP_*` flags +in the `kmalloc()` comment which are not prefixed by a double underscore ("dunder"): + +``` + * %GFP_KERNEL + * Allocate normal kernel ram. May sleep. + * + * %GFP_NOWAIT + * Allocation will not sleep. + * + * %GFP_ATOMIC + * Allocation will not sleep. May use emergency pools. +``` + +We briefly note a +[(non-standards compliant)](https://stackoverflow.com/questions/73542215/leading-underscores-in-linux-kernel-programming) +design choice: +identifiers that begin with an underscore +are more "internal" than those without one, +and those two are are extra internal. +While internal is doing a lot of heavy lifting +in that sentence, the context of each usage clarifies the details. +A less "internal" API function may be +[exported as a symbol](https://docs.kernel.org/core-api/symbol-namespaces.html) +to the rest of the kernel, +while a more "internal" identifier may provide an entry point +to a kernel function that skips certain locking steps, +or in case of +[`_copy_from_user()`](https://elixir.bootlin.com/linux/v6.6/source/include/linux/uaccess.h#L143), +[permissions and protection checks](https://unix.stackexchange.com/questions/674962/why-are-copy-from-user-and-copy-to-user-needed-when-the-kernel-is-mappe). +In the case of the `GFP_*` flags above, +the dunder versions are declared as such +to hint to kernel engineers that these flags +are generally not used directly like the non-dunder versions. + +As can be validated by a `ctrl+f`, +our kernel module uses `GFP_KERNEL`. +This is because we are running in the context of +a user process and therefore it's ok if the +codepath of the allocation includes a sleep or two +before returning to the caller. +We may even schedule out and switch processes multiple times +before the allocation spits out the needed valid memory address. +However, the CPU may be executing code in a context +where sleep is not only undesirable, +but theoretically terminal for the entire system. +One example of such a context is within the +[top-half or bottom-half](https://static.lwn.net/images/pdf/LDD3/ch10.pdf#page=18) +of an interrupt handler. + +The crucial topic of kernel context +deserves its own thorough treatment, +so we will only briefly touch upon it here. +The essential difference for our purpose +is that kernel code can sleep in user context, +while it cannot sleep in atomic or interrupt context. +In process context, we have a process associated with +the running kernel thread, though the immediate business +of the kernel may not be directly relevant to that particular +userspace process. +These kernel threads can copy data to or from userspace memory, +send signals to the current process, +and generally muck around with the +[`struct task_struct`](https://elixir.bootlin.com/linux/v6.6/source/include/linux/sched.h#L743) +found by dereferencing the address the `current` macro resolves to. +On the other hand, +a kernel thread running in interrupt or atomic context +is not associated with any userspace process. +Though `current` will point to the process whose execution +this kernel thread is interrupting, +this thread must accomplish its business as soon as possible. +It cannot sleep at all, +so any memory allocation must return immediately. +The `GFP_NOWAIT` flag requests this behavior with less urgency, +however the `GFP_ATOMIC` flag marks the allocation request with +a huge, red, bold exclamation mark attached, +and requests to be fed with the emergency reserves in the case of low memory. +This is sane, as we would like something like our keyboard to be able to +send interrupts that are immediately received and processed, +even when the bloated closed-source +software we run by choice or by force decides to consume all of our system resources. + +tl;dr just use `GFP_KERNEL` unless you have a good reason not to. + + +At last, we move on to the changes to our classical userspace entry point. + +**Entry to the other side** + +``` +-int main(void) ++int example_init(void) +``` + +This change simply renames `main` to `example_init`. +Do not take this for any sort of magic +as this is nothing but a naming convention +whose purpose will be discussed near the bottom +of this diff analysis. +We could just as well call our module initialization function `main`, +but this would be confusing. +The demotion of this function +from the known entry point styled `main` +sets our footing loose from +that familiar foundation +of the userspace coding environment, +and we will return to this concern +near the bottom of this diff analysis. + +``` ++ int ret = -ENOMEM; +``` + +While the classic idiom of a +print to standard error and +nonzero-argument invocation of the exit syscall +consolidated with the `err()` API call suited our needs +quite satisfactorily back in Kansas, +we will find this exit strategy +falls flat on its face here in Oz. +To begin with, this exit strategy +relies on the invocation of a system call, +that is to say, +an explicit invocation of the kernel by userspace code, +and more specifically, a request for the kernel +to terminate the calling process. +As we are already executing in kernel mode, +there is no need to invoke ourselves, +and we certainly don't wish to commit suicide +on behalf of anyone in the failure case, +least of all on behalf of the kernel itself. +Instead, as the userspace integral file descriptor +is to the kernelspace `struct file`, +the thread-local userspace integral errno variable is +to the kernelspace negative integral errno value. +Though the specific reason for the convention of negativity +is unimportant and perhaps +[unknowable](https://stackoverflow.com/questions/1848729/why-return-a-negative-errno-e-g-return-eio), +one should take note of the convention itself. +We default to the negated out-of-memory errno value of +[-ENOMEM](https://elixir.bootlin.com/linux/v6.6/source/include/uapi/asm-generic/errno-base.h#L16) +as the return code for our function since +that is the only error we check for. +Once we confirm that we are in fact able +to allocate the necessary memory, +we set this value to zero. +One may frequently see code +that defaults the value of the return code to zero. +A careful treatment of that flamewar +is beyond the scope of this section. + +When one of these errno return values is propagated +all the way back to userspace in the context +of a systemcall, +the userspace caller will then be able to +access this value via the thread-local errno variable. + +Keep in mind that a thread-local variable in userspace +corresponds to a per-task variable from the perspective of kernelspace. +A process ID in kernelspace, known as a `pid`, corresponds one-to-one +with a userspace thread ID, known as a `tid`. +Confusingly, a userspace process is identified by +the more common usage of the same term "process ID" or `pid`, +which contains one or more threads, each identified by +a unique thread ID, or `tid`. +When a userspace process contains but a single thread, +the `pid` and the `tid` are the same, +and the kernelspace `pid` refers to the `struct task_struct` +representing the single userspace thread. +For a multi-threaded userspace process, +a userspace `pid` is associated with multiple `tid` values, +and each of these userspace `tid` values corresponds +one-to-one with a kernelspace `pid` value and a representative +`struct task_struct` as the Linux implementation of +the more general concept of a +[Process control block](https://en.wikipedia.org/wiki/Process_control_block). +These threads are grouped together logically, +and so as one might expect, the kernel refers to +the collection of kernelspace `pid` values grouped +under a single userspace `pid` value as userspace `tid`s +by the term "Thread-group ID", abbreviated as `tgid`. + +To summarize: + +|Concept|Userspace name|Kernelspace name| +|--|--|--| +|Single thread|`tid`|`pid`| +|Logical Process|`pid`|`tgid`| + +**Buffering with style** + +```c ++ const char *msg; + struct example *ex = example_create("hello"); ++ msg = KERN_ERR "unable to allocate memory"; +``` + +Though this construct appears strange at first glance, +we will quickly demystify this last assignment +with a quick exposition of C string syntax. +[Section 6.4.5](https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3096.pdf#page=85) +of the C standard defines +the syntax of a string literal. +As a C-literate reader should expect, +a string literal is defined to be a series of characters +from a slight restriction of the character set called "s-chars" +in between terminating double quote characters. +Optionally, the string may be prefixed by what the standard terms an "encoding prefix" +but the details of that are not important here. +To quote the 1 April 2023 working draft, an "s-char" is: +"any member of the source character set except +the double-quote ", backslash \, or new-line character". +[Section 5.1.1.2](https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3096.pdf#page=29) +specifies the order of precedence for translation stages during compilation, +and we see that item 6 clearly states that: +"Adjacent string literal tokens are concatenated." +Therefore, by process of elimination and +before even looking up the definition of `KERN_ERR`, +we know that `KERN_ERR` must be a string literal +because this code compiles and we have no other option. +That covers the syntactic mystery, +but it does not explain the semantics of this statement. + +Allow us one more quick detour that will be necessary just below. +[Section 6.4.4.4](https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3096.pdf#page=82) +of the standard specifies various character constants, +including the encoding prefixes we mention just above. +We see that an "octal-escape-sequence" is a valid "escape-sequence", +and that it is specified with one, two, or three octal digits following a backslash. +The "octal-escape-sequence" is the only one +which is implemented with no character between the backslash +and the value itself. +For example, one begins a hexadecimal escape sequence using "\x", +and a +[universal character name](https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3096.pdf#page=75) +using a "u" or "U". + + +Let us turn to the +[definition](https://elixir.bootlin.com/linux/v6.6/source/include/linux/kern_levels.h#L11) +of this symbol in the `kern_levels.h` header, +whose brief 39 lines we will include in their entirety from the v6.6 source: + + +```c +$ cat include/linux/kern_levels.h +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __KERN_LEVELS_H__ +#define __KERN_LEVELS_H__ + +#define KERN_SOH "\001" /* ASCII Start Of Header */ +#define KERN_SOH_ASCII '\001' + +#define KERN_EMERG KERN_SOH "0" /* system is unusable */ +#define KERN_ALERT KERN_SOH "1" /* action must be taken immediately */ +#define KERN_CRIT KERN_SOH "2" /* critical conditions */ +#define KERN_ERR KERN_SOH "3" /* error conditions */ +#define KERN_WARNING KERN_SOH "4" /* warning conditions */ +#define KERN_NOTICE KERN_SOH "5" /* normal but significant condition */ +#define KERN_INFO KERN_SOH "6" /* informational */ +#define KERN_DEBUG KERN_SOH "7" /* debug-level messages */ + +#define KERN_DEFAULT "" /* the default kernel loglevel */ + +/* + * Annotation for a "continued" line of log printout (only done after a + * line that had no enclosing \n). Only to be used by core/arch code + * during early bootup (a continued line is not SMP-safe otherwise). + */ +#define KERN_CONT KERN_SOH "c" + +/* integer equivalents of KERN_ */ +#define LOGLEVEL_SCHED -2 /* Deferred messages from sched code + * are set to this special level */ +#define LOGLEVEL_DEFAULT -1 /* default (or last) loglevel */ +#define LOGLEVEL_EMERG 0 /* system is unusable */ +#define LOGLEVEL_ALERT 1 /* action must be taken immediately */ +#define LOGLEVEL_CRIT 2 /* critical conditions */ +#define LOGLEVEL_ERR 3 /* error conditions */ +#define LOGLEVEL_WARNING 4 /* warning conditions */ +#define LOGLEVEL_NOTICE 5 /* normal but significant condition */ +#define LOGLEVEL_INFO 6 /* informational */ +#define LOGLEVEL_DEBUG 7 /* debug-level messages */ + +#endif +``` + +By examination of the above header, +we observe the resolved value of `KERN_ERR` +to be a string literal containing two bytes, +the "\001" octal escape sequence which represents +"start of heading" in the +[ASCII](https://man7.org/linux/man-pages/man7/ascii.7.html) +standard, followed by the ASCII character literal "3", +which can just as easily be represented using "\063", +however the kernel chooses to be readable. + +This may be obvious by this point, +but these bytes are used to specify +the relatively +[well-documented](https://www.kernel.org/doc/html/latest/core-api/printk-basics.html) +kernel logging level. +The usage of usage of the `KERN_*` prefix before a string literal +is generally done within the parenthesis of a `printk()` invocation, +such as the one we use later in the code, +however we assign the resulting string value to a local `char *` variable +to demonstrate what is really going on +and dispel any illusions the reader may hold. +We believe this more verbose, +multi-step usage is less likely to trigger +that part of the trained C programmer's brain +which says that there is a comma missing. + +Though direct usage of `printk()` is acceptable, +we recommend the usage of the `pr_*` macros +described in the +[printk documentation](https://www.kernel.org/doc/html/latest/core-api/printk-basics.html), +as these helpful wrappers will prevent +one's attempted kernel build +from generating strange-looking macro-resolution errors +in the case one makes a typo. +Usage of `KERN_ERROR` is such an example. +We believe it will be easier to spot the error +when one attempts to build kernel code containing +the alternative equivalent mistaken usage of `pr_error()` +in place of the correct `pr_err()`. +In addition, the code is cleaner and shorter when +using the `pr_*` family of functions, +and you can define customized wrappers on a per-file basis +by redefining `pr_fmt()`, which we will explain below. + +``` + if(!ex) +- err(1, "unable to allocate memory"); +- printf("%s\n", example_get_message(ex)); +- if(!example_update_message(ex, "goodbye")) { +- int temperrno = errno; +- example_destroy(ex); +- errno = temperrno; +- err(1, "unable to update"); +- } +- printf("%s\n", example_get_message(ex)); ++ goto out; +``` + +With our handy `goto` statement, +we can dispose of all that mid-function +error handling code and consolidate the codepaths +of this function to flow through a single exit point. + +``` ++ ++ pr_info("%s\n", example_get_message(ex)); ++ ++ msg = KERN_ERR "unable to update\n"; ++ if(!example_update_message(ex, "goodbye")) ++ goto out_free; ++ ++ pr_info("%s\n", example_get_message(ex)); +``` + +Here, we make use of the `pr_info()` macro helper +to do exactly what `printk()` would have done, +but without having to include that strange looking +syntax prefixing the format string with a macro +separated by nothing but whitespace. +Actually, as we mention above, +the `pr_*` family provides one extra feature +that we do not use but we feel is worth a quick discussion. + +The +[definition](https://elixir.bootlin.com/linux/v6.6/source/include/linux/printk.h#L528) +of `pr_info()` +passes the format string wrapped with yet another macro, +this being `pr_fmt`. +As the +[API documentation](https://www.kernel.org/doc/html/latest/core-api/printk-basics.html#c.pr_fmt) tells us, +we can define a custom format to be used each time +a `pr_*` macro is subsequently invoked in that translation unit. +The example given in the documentation is +yet another macro, `KBUILD_MODNAME`, +Which is resolved at build time by +[Kbuild](https://docs.kernel.org/kbuild/kbuild.html), +the Linux kernel's bespoke build system, +and a flag set by +[scripts/Makefile.lib](https://elixir.bootlin.com/linux/v6.6/source/scripts/Makefile.lib#L126) +is passed to the compiler, defining this value appropriately in each context. +This is common, but one may use any string they like, +or leave out the definition entirely, as we do in this module. + +These two invocations of `pr_info()` +are the kernelspace replacements +for the two `printf()` calls back in our userspace code, +and here too, the success of these two calls results in +the strings "hello" and "goodbye" appearing in some external buffer. + +**Three clean exits** + +``` ++ ret = 0; +``` +The value of `ret` before this assignment is `-ENOMEM`, +so we must clear the error and set the return value +to 0, which indicates success. + +``` ++ msg = NULL; +``` + +As the `msg` variable contains an error message, +we set it to `NULL` to skip the invocation of `printk()` just below. + +``` ++out_free: + example_destroy(ex); +- return 0; ++out: ++ if(msg) ++ printk(msg); ++ return ret; ++} +``` +Finally, we conclude the definition of `example_init()` +by overlapping three exit cases together +using the `goto` statements defined earlier and the two labels +we define just above. +This is less complex than it may seem, +and as you may notice, we only use one level of indentation. + +First, the success case. +If all goes right, the CPU arrives at the code following +the `out_free` label, continues right along +after invocation of `example_destroy()`, +moves right past `out`, and jumps past the `printk()` +due to the `NULL` value of `msg` set just above. +We return with the value of `ret` set to `0`, +which is also taken care of just above, +and that's that for an error-free execution of `example_init()`. + +Second, our first call to `kmalloc()` to allocate memory +for a `struct example` may fail. +Then, our error-checking conditional leaves us +on the `goto out;` line just following, +and right away, the CPU is then executing just below the `out` label. +At this point, the value of `msg` is +the string "unable to allocate memory" +prefixed by "\001" "3", a.k.a `KERN_ERR`. +As this value is in fact not `NULL`, +we pass it to `printk()` +and we expect to see this string show up in our kernel ring buffer. +As always, we can check this with `dmesg`. +To conclude, we return the value of `ret`, +which is unmodified since its initialization and declaration +and therefore is `-ENOMEM`, which is correct. + +Third and finally, +we may succeed in allocating memory +for a `struct example`, +but then fail somewhere in `example_update_message()`, +which is indicated by a logically false, i.e. `0` return value +from the conditional wrapped invocation. +We can inspect this short function +and see that this failure can only happen in a single case, +and that case is also failed allocation, +but this detail is not important here. +What is important in the context of handling this error in the caller +is that we are responsible for `free`ing the memory +we allocated just before this to store our `struct example`. +If we were to simply return to the caller of `example_init()` right now, +not only would we lose the syntactically clean unified exit path, +we would generate a memory leak. +We also want to print the contents of the string data +at `msg`'s address to the kernel ring buffer, +and for whatever remains of brevity in this example, +we don't bother modifying the contained string. +Therefore, we jump over the second `pr_info()` invocation +and the assignment of appropriate success-case values +to `ret` and `msg`, +and immediately invoke `example_destroy()` on the address +we obtained from the initial and successful call to `kmalloc()`. +This closes the loop in terms of allocation +and prevents the module from leaking memory. +Do not forget that the severity of a memory leakage in the kernel +is almost always far greater than in a user program, +especially a short-lived one. +As you may recall from the exposition above, +should we modify our example user program program above +to leak memory, which can be implemented by the removal +of one or more calls to `free()`, we can easily debug +the issue with `valgrind`, +and regardless, +the kernel will clean up our mess +upon termination of the process and its threads. +In the kernel, every similar memory leak +will persist until the system is reset. +We emphasize this to illustrate the importance +of correctly managing the memory of kernel code +even in the more subtle codepaths such as this third case. +Once we free the memory at `ex`, +the non-`NULL` value of `msg` triggers the call to `printk()` +just as in the second case, +and finally, +we return `-ENOMEM`, +also just like the second case. + +``` ++void example_exit(void) ++{ + } +``` +We define this empty function because we need +to give a callable address with a particular type signature +to the kernel's module subsystem. +This is explained just below. + +**The final plumbing** + +``` ++module_init(example_init); ++module_exit(example_exit); +``` + +In order to properly explain +these two macro invocations, +we first need to take a step back +and talk about the bigger picture. + +We are translating a C program designed +to compile into an executable binary file +that creates a single thread and interacts with Linux from userspace +into a C program designed to +compile into the Linux implementation of a +[loadable kernel module](https://en.wikipedia.org/wiki/Loadable_kernel_module) +that interacts with the kernel API +and expects to run on a CPU in privileged mode. +To build and run this code, +we first need to write an idiomatic makefile +and make sure the files necessary to build +modules specifically for the running or target kernel +are present in their expected locations. +When all of this is in place, +we can build a "kernel object" file, +whose filename is canonically but meaninglessly +suffixed with ".ko". +Using a utility like +[`insmod`](https://git.kernel.org/pub/scm/utils/kernel/kmod/kmod.git/tree/tools/insmod.c) +we can pass this kernel object +to either the +[`init_module(2)` or `finit_module(2)`](https://man7.org/linux/man-pages/man2/init_module.2.html) +syscall, though in practice the `insmod` and `modprobe` utilities from +[kmod](https://git.kernel.org/pub/scm/utils/kernel/kmod/kmod.git) exclusively invoke the +[latter](https://git.kernel.org/pub/scm/utils/kernel/kmod/kmod.git/tree/shared/missing.h#n36) +due to an engineering preference for working with file descriptors. + +This syscall loads the module into kernel memory, +and if needed, relocates symbols and initializes module parameters. +After this, the kernel invokes the module's `init` function. +Now as we have made abundantly clear by now, +the `main` function that the C standard so generously specifies in +[section 5.1.2.2.1](https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3096.pdf#page=31) +is not relevant to a Linux kernel module. +Instead of using a pre-defined name as our entry point, +we simply set the module's init function +to the address of a function of our choice +with the only constraint being the type signature, +which must be `int (*)(void)`. +Within the definition of the intuitively-named +[struct module](https://elixir.bootlin.com/linux/v6.6/source/include/linux/module.h#L402), +we find a member named +[`init`](https://elixir.bootlin.com/linux/v6.6/source/include/linux/module.h#L458), +with just the type signature we expect. +This `init` member holds the address of the init function +defined by a given module and +this `struct module` is the in-kernel representation +of a Linux kernel module. +Likewise, when we wish to unload a kernel module, +we use a tool such as +[`rmmod`](https://git.kernel.org/pub/scm/utils/kernel/kmod/kmod.git/tree/tools/rmmod.c) +or the +[removal mode](https://git.kernel.org/pub/scm/utils/kernel/kmod/kmod.git/tree/tools/modprobe.c#n114) +of modprobe, +which passes the name of the module into the kernel by way of the +[`delete_module(2)`](https://man7.org/linux/man-pages/man2/delete_module.2.html) +syscall. +After checking whether the supplied name refers to +an extant loaded kernel module +with no outstanding references +held by other modules, +the kernel checks whether an `exit` function +is defined for the module. +If so, it is invoked before the module is unloaded. +The address of this `exit` function, +just like `init`, +is stored within a module's `struct module` +as a member helpfully named +[`exit`](https://elixir.bootlin.com/linux/v6.6/source/include/linux/module.h#L568). +While specification of a module `init` function is mandatory, +specification of an `exit` function is not. +If we don't ever need or want to unload the module, +then `exit` will never be called, +so we can exclude it entirely. +Since we do in fact wish to be able to unload our module +but we don't have anything to cleanup at unload time, +we simply define a dummy function and set `exit` to its address. + +We now return to the point +from which we took a step back, +namely, the usage of the `module_init` and `module_exit` +seen just above. +This is the method we use +to set the `init` and `exit` members +of the soon-to-be-generated `struct module` +that will be packaged into the kernel object file +by the kernel build system. +The two macros are +[defined](https://elixir.bootlin.com/linux/latest/source/include/linux/module.h#L130) +one right after the other. +The first part of the definition may initially bamboozle the reader, +however when we take away the semantically irrelevant +[`static` storage class specifier](https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3096.pdf#page=116), +the +[`inline` function specifier](https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3096.pdf#page=141), +and +[`unused` function attribute](https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-unused-function-attribute) +hiding right behind the +[`__maybe_unused` macro](https://elixir.bootlin.com/linux/v4.14/source/include/linux/compiler-gcc.h#L128), +we find the definition of a dummy function named `__inittest` +which takes no arguments and returns a value of type `initcall_t`. +The sole statement in the function body returns the address +of our candidate to be set as the module's `init` function. +The `unused` attribute hints at the true intent of this function. +The compiler will throw an error if the type signature of our chosen function +differs from that of +[initcall_t](https://elixir.bootlin.com/linux/v6.6/source/include/linux/init.h#L118). +This dummy function implements that compliance check at compile time, +saving all users the headache of debugging the runtime consequences +of a module author mistakenly using something exotic and noncompliant. +The final line is the business end of the macro definition. +We declare a function named `init_module` with the same type signature as `initcall_t`. +Then, we utilize the +[alias function attribute](https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-alias-function-attribute) +to bind the address of our `init` function to the `init_module` symbol. +One of the artifacts generated and used by a kernel module build +is a file given the same filename as the primary C program source, +but with a `.mod.c` extension replacing the simple `.c`. +This file contains the following snippet, or something similar: + +```c +__visible struct module __this_module +__section(".gnu.linkonce.this_module") = { + .name = KBUILD_MODNAME, + .init = init_module, +#ifdef CONFIG_MODULE_UNLOAD + .exit = cleanup_module, +#endif + .arch = MODULE_ARCH_INIT, +}; +``` + +Thus the `init` function of our choosing is set as the module's `init` function +and made available to the kernel on a module load via its inclusion in this +generated `struct module`. +The `cleanup_module` function is similarly generated +by the `module_exit` macro. +This is how the kernel knows about the `example_init` and `example_exit` +functions in our kernel module example. + + +``` ++MODULE_LICENSE("GPL"); + +``` + +This line is required. +If we attempt to build our module without it, +we find that the modpost stage of the kernel build system +[complains and explodes](https://elixir.bootlin.com/linux/v6.6/source/scripts/mod/modpost.c#L1731). +This suicide stems from a failed check for a string +beginning with "`license=`" +in the module binary. +This string is emitted by the +[definition of `__MODULE_INFO`](https://elixir.bootlin.com/linux/v6.6/source/include/linux/moduleparam.h#L23) +which is the powerhouse of the +[`MODULE_LICENSE` macro](https://elixir.bootlin.com/linux/v6.6/source/include/linux/module.h#L230). +We use the string `"GPL"` to refer to the +[GNU General Public License](https://en.wikipedia.org/wiki/GNU_General_Public_License), +specifically +[version 2](https://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html). +Our usage of this license designation in the module source code +assigns this free software license to our code. +This allows any individual or company to ensure +that their usage of our module or any other +complies with their legal and philosophical constraints. + +#### The Emerald City + +We have reached the end +of the yellow brick road +of our diff. +Although everything you need +to generate the final kernel driver is above, +we include the final result right here +for emphasis and ease. + +```c +$ cat kernel_code.c +#include +#include +#include + +struct example +{ + char *message; + size_t size; +}; + +static struct example *example_create(const char *msg) +{ + struct example *ex = kmalloc(sizeof *ex, GFP_KERNEL); + if(!ex) + goto out; + ex->size = strlen(msg); + ex->message = kstrdup(msg, GFP_KERNEL); + if(!ex->message) + goto out_free; + return ex; +out_free: + kfree(ex); + ex = NULL; +out: + return ex; +} + +static void example_destroy(struct example *ex) +{ + kfree(ex->message); + kfree(ex); +} + +static bool example_update_message(struct example *ex, const char *msg) +{ + size_t size = strlen(msg); + char *data = kstrdup(msg, GFP_KERNEL); + if(!data) + return false; + kfree(ex->message); + ex->message = data; + ex->size = size; + return true; +} + +static char *example_get_message(struct example *ex) +{ + return ex->message; +} + +int example_init(void) +{ + int ret = -ENOMEM; + const char *msg; + struct example *ex = example_create("hello"); + msg = KERN_ERR "unable to allocate memory"; + if(!ex) + goto out; + + pr_info("%s\n", example_get_message(ex)); + + msg = KERN_ERR "unable to update\n"; + if(!example_update_message(ex, "goodbye")) + goto out_free; + + pr_info("%s\n", example_get_message(ex)); + + ret = 0; + msg = NULL; +out_free: + example_destroy(ex); +out: + if(msg) + printk(msg); + return ret; +} + +void example_exit(void) +{ +} + +module_init(example_init); +module_exit(example_exit); + +MODULE_LICENSE("GPL"); +``` + +For further convenience, +we include an idiomatic makefile +which will build the above kernel module +on a properly configured system. + +```Makefile +$ cat Makefile +obj-m += kernel_code.o + +.PHONY: build clean load unload + +build: + make -C /lib/modules/$(shell uname -r)/build modules M=$(shell pwd) +clean: + make -C /lib/modules/$(shell uname -r)/build clean M=$(shell pwd) +load: + sudo insmod kernel_code.ko +unload: + -sudo rmmod kernel_code +``` diff --git a/_articles/proc_filesystem.md b/_articles/proc_filesystem.md new file mode 100644 index 0000000..b478bfe --- /dev/null +++ b/_articles/proc_filesystem.md @@ -0,0 +1,103 @@ +--- +layout: article +title: The Process Filesystem +--- +Unlike some of the more esoteric resources that +can be referred to by a file descriptor, +the entries found in the `/proc` directory on +any Linux system are in fact real files. + +However, they are not entirely like other files: +they are transient. +That is to say, these files are not stored +on any long-term storage +media, e.g. a hard drive. +These files don't need long term storage because +they provide access to information that only +exists at runtime. + +Instead of reading the directory +structure and contents from a storage medium, +the kernel creates the files in `/proc` at runtime +and synthesizes their contents on demand. + +Specifically, the kernel creates a directory for each +running process on the system named after its pid. +In addition, the kernel provides a "magic" symlink +named `self` +whose target depends on which process is looking. +Any process that examines the symlink +sees it resolve to the folder that corresponds to +the calling process's pid. + +This directory contains information about running processes. +For a complete list of the contents, refer to the kernel +[documentation](https://docs.kernel.org/filesystems/proc.html) and the +[manpage](https://man7.org/linux/man-pages/man5/proc.5.html). + + +Unfortunately, +`/proc` also contains many +miscellaneous files that were added +before the community developed `/sys`. +They are still present to preserve +backwards compatibility. + +### A `/proc`tical example + +In bash, `$$` is a +[special variable](https://www.gnu.org/software/bash/manual/html_node/Special-Parameters.html) +that expands to the pid of the bash process. + +For example: + + $ echo $$ + 1337 + +This means we can use `$$` when building a path +to reference the `/proc` subdirectory corresponding +to the running bash process. +In P1, the systemcall used the +`get_task_comm` kernel macro to find the name +of the running program. +`/proc` also provides userspace access to this +information. Here is an example: + + $ cat /proc/$$/comm + bash + +We can also discover the absolute path of the +executable invoked to start the process by +traversing another "magic" symlink named `exe`: + + $ readlink /proc/$$/exe + /usr/bin/bash + +If we replace `$$` with `self`, +we are now referring to the child process +the shell created by `fork`ing itself +and `exec`ing the user command: + + $ cat /proc/self/comm + cat + + $ readlink /proc/self/exe + /usr/bin/readlink + +Another useful entry in `/proc` for +a given process is the `fd` directory, +which contains magic symlinks to all file +descriptors owned by the process: + + $ ls -l /proc/self/fd + ... 0 -> /dev/pts/0 + ... 1 -> /dev/pts/0 + ... 2 -> /dev/pts/0 + ... 3 -> /proc/128523/fd + +As expected, the first three entries are +`stdin`, `stdout`, and `stderr` +which are connected to our terminal. +We can also see how the `ls` program opens +its own subdirectory in `/proc` by following +the "magic" `/proc/self` symlink. diff --git a/_articles/syscall_basics.md b/_articles/syscall_basics.md new file mode 100644 index 0000000..8560fbf --- /dev/null +++ b/_articles/syscall_basics.md @@ -0,0 +1,48 @@ +--- +layout: article +title: Syscall Basics +--- +### C system programing interactive demo + +0. the most basic C program possible: `main(){}` + +1. hello world + +2. What is a system call? + + * Synchronously ask Linux to do something for you as a user program + + * Student examples: open, read, write, close + + * Use these with file descriptors to do file I/O without `FILE *` + + * Use `man 2 xxx` to learn about system call `xxx` + +3. Working with processes + + * [execve](https://man7.org/linux/man-pages/man2/execve.2.html): transform one process into another + + * [fork](https://man7.org/linux/man-pages/man2/fork.2.html): split one process in two + + * [wait](https://man7.org/linux/man-pages/man2/wait.2.html): wait for a child process + +### Continue Syscall Demo + +* Charlie demonstrated several system calls needed to complete `P0`, with a focus on dealing with file descriptors + +* Take note that the `struct stat` for a file contains a `mode` field that specifies file permissions + +* Use the [`dup(2)`](https://man7.org/linux/man-pages/man2/dup.2.html) family to create unnamed pipes on your system + +* When working with C strings in the kernel, we recommend passing around a pointer and a length pair rather than relying on null-termination + + * Once must take extra precaution to avoid buffer overflows in the kernel + * While in userspace a buffer overflow is harmful, in the kernel it can be catastrophic to a system, and in a production environment, devastating to an organization + + * We will spend more time discussing security concerns later on + +* Note that file descriptors are integers that index into a table of file descriptions. + +* [`dup(2)`](https://man7.org/linux/man-pages/man2/dup.2.html) uses the file descriptor table to connect two file descriptors to each other. + +* While `dup()` takes an fd to duplicate, `dup2()` takes a second existing fd to overwrite as the other end of the pipe. diff --git a/_articles/syscalls_end_to_end.md b/_articles/syscalls_end_to_end.md new file mode 100644 index 0000000..c29d995 --- /dev/null +++ b/_articles/syscalls_end_to_end.md @@ -0,0 +1,131 @@ +--- +layout: article +title: "Syscalls: End-to-End" +--- +### Part I + +All previously discussed exceptions were just that: exceptional. + +The kernel handled them behind the scenes invisibly to the user or they were the result +of a bug in which case the kernel sent the program a fatal signal (e.g. `SIGILL`, `SIGSEGV`, and `SIGFPE`) + +Today we demonstrate the mechanism by which your code can intentionally call a system function running in kernel mode; in other words, a system call. + +We want to know how the userspace invocation ultimately connects back to the `SYSCALL_DEFINE` macro that you hunted down, traced, and wrote a history report on during +[E1](https://kdlp.underground.software/course/fall2023/assignments/E1.md) + +Let's follow the path of a syscall on x86. + +First of all, a user program invokes the +[syscall](https://www.felixcloutier.com/x86/syscall) +instruction. + +As described therein, the CPU immediately elevates +privilege level and jumps to the address in the LSTAR MSR register. + +Who set the value of this `LSTAR` Model Specific Register? + +In the +[syscall_init](https://elixir.bootlin.com/linux/v6.5/source/arch/x86/kernel/cpu/common.c#L2054) +function, we find this +[line](https://elixir.bootlin.com/linux/v6.5/source/arch/x86/kernel/cpu/common.c#L2057): + + wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64); + +Which sets set the value of LSTAR to the address of the +[entry_SYSCALL_64](https://elixir.bootlin.com/linux/v6.5/source/arch/x86/entry/entry_64.S#L87) +function. + +This is not C code :) + +The `.S` suffix indicates this is not just regular assembly code, but code that must be run through `cpp` in order to resolve macros in the file. The kernel uses these macros to make it easier to write the code correctly and facilitate interoperability with the rest of the kernel. + +At this time, the kernel is executing in a privileged mode, but all of the registers refer to userspace data. + + +### Part II + +At the end of part I we had discussed the syscall process up when the processor jumps to the +[entry_SYSCALL_64](https://elixir.bootlin.com/linux/v6.5/source/arch/x86/entry/entry_64.S#L87) +function within `entry_64.S` which is written in assembly. As documented [here](https://www.felixcloutier.com/x86/syscall), +for performance reasons the only actions taken by the `syscall` instruction besides elevating privilege to ring 0 are: + +* Saving the return address before jumping to the kernel handler: + * The current instruction pointer `rip` is copied into `rcx` + * The address of the kernel handler is loaded from the `LSTAR` model specific register into `rip` +* Saving the current processor flags before reseting them to a known value + * The current flags `RFLAGS` are copied into `r11` + * The flags register is adjusted with a bitmask from the `FMASK` model specific register + +Within the `entry_SYSCALL_64` handler function, it is the responsibility of the kernel to save any +other userspace state that it wishes to restore when the syscall returns. +In the case of the linux kernel, all normal CPU registers should be saved. + +However, this presents a problem because essentially all cpu instructions involve manipulating the +values stored in the cpu registers. We want to save the data somewhere in memory, but we can't even +load a fixed pointer into a register to move data into that memory location because that would clobber +one of the values we need to save. + +Fortunately the designers of the CPU built an escape hatch in for this exact problem: the `swapgs` instruction. + +On x86 `gs` is a special type of register called a "segment" register. Segment registers were historically added to +facilitate easier access to more than 64K of memory on Intel's 16bit 8086 cpu (general purpose registers could store +16 bit pointers and segmentation could fill in the correct higher order bits to determine the full virtual address +depending on what your instruction was doing with the pointer (using it to access code, or data, or the stack, etc.). + +Segmentation is no longer a concern on 64 bit systems where the registers can easily store pointers to orders of magnitude +more virtual addresses than any computer could have physical ram, but the segmentation registers still exist on modern CPUs +and they have picked up a new function: storing a pointer for accessing thread local data. The `gs` register holds a pointer +to a block of memory reserved for thread specific data and any instruction can access this pointer by setting the `gs` prefix +on a memory access and providing the desired offset into the thread specific data as the "address". The CPU will add the base +address of the thread specific data from `gs` to the offset supplied in the instruction and the thread local data will be accessed. + +The special `swapgs` instruction ([line 91](https://elixir.bootlin.com/linux/v6.5/source/arch/x86/entry/entry_64.S#L91)) allows +ring 0 (kernel) code to atomically swap the value of `gs` with a well known value previously established by the kernel in a model +specific register that will hold a pointer to per cpu data while saving the old `gs` value from userspace into a different MSR +so it can be restored later. + +The handler code can then use scratch space allocated in the per cpu block to save the userspace stack pointer +([line 93](https://elixir.bootlin.com/linux/v6.5/source/arch/x86/entry/entry_64.S#L93)) and replace `rsp` with a pointer to +kernel stack ([line 95](https://elixir.bootlin.com/linux/v6.5/source/arch/x86/entry/entry_64.S#L95)). Once that has been completed, + the rest of the registers can be saved by just pushing them onto the kernel stack. The values are pushed in a specific order +([lines 100-109](https://elixir.bootlin.com/linux/v6.5/source/arch/x86/entry/entry_64.S#L100)) to make the overall footprint of the +data on the stack match the layout of a [`struct pt_regs`](https://elixir.bootlin.com/linux/v6.5/source/arch/x86/include/asm/ptrace.h#L59). + +This means that after all the pushing, `rsp` is a valid `struct pt_regs *` pointer. It can be copied into `rdi` +([line 112](https://elixir.bootlin.com/linux/v6.5/source/arch/x86/entry/entry_64.S#L112)) to be the first argument along with +the syscall number in `rax` into `rsi` ([line 114](https://elixir.bootlin.com/linux/v6.5/source/arch/x86/entry/entry_64.S#L114)) +to become the second argument when it calls the C function [`do_syscall_64`](https://elixir.bootlin.com/linux/v6.5/source/arch/x86/entry/common.c#L73). + +From this point things are more simple, the kernel attempts to interpret the syscall number as a 64 bit syscall by calling +[`do_syscall_x64`](https://elixir.bootlin.com/linux/v6.5/source/arch/x86/entry/common.c#L40) and we can assume this is successful +if we are calling from 64 bit code. + +The meat of that function is verifying that the syscall number is in range ([line 48](https://elixir.bootlin.com/linux/v6.5/source/arch/x86/entry/common.c#L48)) +and then looking up the function pointer for the corresponding syscall number in an array then calling it and saving the return value into the entry for the ax register in +the `struct pt_regs` that will be restored when the kernel code returns ([line 50](https://elixir.bootlin.com/linux/v6.5/source/arch/x86/entry/common.c#L50)). + +The [`sys_call_table`](https://elixir.bootlin.com/linux/v6.5/source/arch/x86/entry/syscall_64.c#L16) array in generated using a technique called an +[X macro](https://en.wikipedia.org/wiki/X_macro). During the kernel build process, a header file is generated using +[the syscall table](https://elixir.bootlin.com/linux/v6.5/source/arch/x86/entry/syscalls/syscall_64.tbl) +that invokes a macro `__SYSCALL` (that is not defined within the header) once for each syscall with arguments of its number and entry point. + +The `__SYSCALL` macro can be given whatever definition the user wants and then the header can be included to programmatically generate invocations +of that specific version of the macro for each syscall. Within [`arch/x86/entry/syscall_64.c`](https://elixir.bootlin.com/linux/v6.5/source/arch/x86/entry/syscall_64.c) +the `__SYSCALL` macro is defined twice, the first time ([lines 10-12](https://elixir.bootlin.com/linux/v6.5/source/arch/x86/entry/syscall_64.c#L10)) +it uses the syscall name in the argument to form a declaration for a function named `__x64_sys_something` that takes a `const struct pt_regs *` argument, +and the second time ([lines 14-18](https://elixir.bootlin.com/linux/v6.5/source/arch/x86/entry/syscall_64.c#L14)) +it fills the `sys_call_table` variable will pointers to each of those functions in the right order. + +These wrapper functions are defined as part of the [`SYSCALL_DEFINE` macro](https://elixir.bootlin.com/linux/v6.5/source/arch/x86/include/asm/syscall_wrapper.h#L228). +The [`__X64_SYS_STUBx`](https://elixir.bootlin.com/linuxv/v6.5/source/arch/x86/include/asm/syscall_wrapper.h#L96) macro generates +a function named `__x64_sys_whatever` that takes a `struct pt_regs` whose body just calls another wrapper starting the `__se_` with the real syscall args. +These are obtained by the [`SC_X86_64_REGS_TO_ARGS`](https://elixir.bootlin.com/linux/v6.5/source/arch/x86/include/asm/syscall_wrapper.h#L56) +macro which converts the list of arguments into accessing `regs->register` for each register in the appropriate order. + +The `__se_` wrapper (short for sign extension) has to do with 32 bit compatibility and is generated in place by the SYSCALL_DEFINE macro +([line 233](https://elixir.bootlin.com/linux/v6.5/source/arch/x86/include/asm/syscall_wrapper.h#L233)), and finally it calls into +a function with the prefix `__do_` that is the function whose header is right at the end of the wrapper ([line 240](https://elixir.bootlin.com/linux/v6.5/source/arch/x86/include/asm/syscall_wrapper.h#L240)) +and whose body is supplied by the code within the curly braces that follows a given `SYSCALL_DEFINE` invocation. + +At this point it is running the code in that block and the syscall has officially begun :) diff --git a/_articles/template.md b/_articles/template.md deleted file mode 100644 index ccf5907..0000000 --- a/_articles/template.md +++ /dev/null @@ -1,5 +0,0 @@ ---- -layout: article -title: Article Title ---- -Lorem ipsum diff --git a/_articles/topics.md b/_articles/topics.md new file mode 100644 index 0000000..4ead607 --- /dev/null +++ b/_articles/topics.md @@ -0,0 +1,79 @@ +--- +layout: article +title: "Intro to Kernel Development: The Topics" +--- +Working in Linux as an engineer +-- + +- setting up development env +- using git +- using ./configure && make (autotools) +- commandline basics & tips, tricks +- awk & more cool stuff + + +What is a Unix-like Operating System? +-- + +- Linux vs Unix & history: where does Linux come from? +- building a distro, minimal distro ("aparatus") +- what is a Linux process, syscall basics, file descriptor intro +- e.g. grep in a child process (bill cs308 assignment 2) +- open read write fork exec dup pipe wait close exit +- character devices and their operations + +Hardware meets software +-- + +- What is CPU privilege + - trying to execute forbidden instructions (rdmsr_priv_demo) + - kernel abilities vs userspace +- exceptions: traps, interupts, aborts +- featured example: Syscalls end-to-end + - tracing a syscall intruction from the invocation + - ??? stuff happens ??? + - response back to userspace + +Assembly and C for kernel development +-- + +- writing makefiles +- C compilation deep dive, object files +- assembly vs machine code +- writing assembly to interact with the kernel without libc + +Debugging +-- + +- searching code: cscope, elixr, git grep +- raw ftrace vs library ftrace +- ptrace +- strace +- perf +- gdb with qemu-system +- bpftrace +- valgrind + +Pseudo-filesystems as kernel interfaces: a Linux specialty +-- + +- /proc +- /sys +- /dev +- cgroup & ftrace + +Kernel Modules as a way to learn about writing kernel code +-- + +- hello world module +- data types +- concurrency & kernel locking API + +Participating in open source communities +-- + +- public speaking +- technical writing +- sending patches in general +- upstream Linux mailing list tour +- code review diff --git a/_articles/tracing.md b/_articles/tracing.md new file mode 100644 index 0000000..b35fa2e --- /dev/null +++ b/_articles/tracing.md @@ -0,0 +1,390 @@ +--- +layout: article +title: Kernel Tracing +--- +### Tracing with ftrace + +Ftrace is mounted as tracefs file system, usually at `/sys/kernel/debug/tracing` + +The kernel source has some +[documentation](https://www.kernel.org/doc/Documentation/trace/ftrace.txt) + +The demo in class was done approximately like so: + +0. Ensure tracing is off and wipe the buffer: + + echo -n > trace + echo 0 > tracing_on + +0. Select the function_graph tracer instead of nop tracer + + echo function_graph > current_tracer + +0. Turn tracing on and back off again + + echo 1 > tracing_on + echo 0 > tracing_on + +0. Take a look at the output + + {one of: less,nano,vim} trace + +The events recorded in this file all took place between enable and disabling tracing. There should be a large number of lines in this file. + +### Finding syscall definitions in kernel source + +This file defines the `SYSCALL_DEFINE*` macros: + +[/include/linux/syscalls.h](https://elixir.bootlin.com/linux/latest/source/include/linux/syscalls.h) + +This will quickly locate the syscall named `$SYSCALLNAME` in the kernel: + + git grep '^SYSCALL_DEFINE.($SYSCALLNAME' + +e.g: `^SYSCALL_DEFINE.(read'` + +### Tracing with bpftrace + +bpftrace is a scripting language that implements a frontend for the BPF Linux subsystem. + +[BPF](https://en.wikipedia.org/wiki/Berkeley_Packet_Filter) (formerly eBPF) is a Linux subsystem that implements a virtual machine to quickly insert verified BPF bytecode programs while the kernel is running. + +The bpftrace syntax is based on +[awk](https://en.wikipedia.org/wiki/AWK), +a small scripting language for rapidly processing text files using very simple programs. + +Here's a complete hello world executable script in awk: + + #!/bin/awk -f + + BEGIN { printf("hello, World!\n"); } + +You can also run this directly in the shell: + + [root@kdlp ~]# awk 'BEGIN { printf("hello, World!\n"); }' + +This also valid bpftrace (you must `dnf install bpftrace`): + + [root@kdlp ~]# bpftrace -e 'BEGIN { printf("hello, World!\n"); }' + Attaching 1 probe... + hello, World! + +Use this if you run into problems with bpftrace: +[Great Reference Guide](https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md) + +Buy this to take your BPF knowledge to the next level: +[The BPF Bible: how to become one with the kernel](https://www.amazon.com/Performance-Tools-Addison-Wesley-Professional-Computing/dp/0136554822) + + +**Cscope Interlude** + +With the vim cscope plugin and cscope installed on your system: + +Run: `cscope -R -k` in the root of your kernel tree to create your cscope database. + + +Within vim you can use `cs find g struct cs find t 'struct whatever {'` to find the definition of that structure. You can also do this without vim by directly running `cscope` in your kernel tree. + + +**A Quick Example** + +List all available kprobes that trace functions containing the characters xxx: + + bpftrace -l k:*xxx* + +On every invocation of the `xxx` kernel function, print the command name string and the pid for the current task on the CPU (see: +[task_struct](https://elixir.bootlin.com/linux/latest/source/include/linux/sched.h#L738)). + + bpftrace -e 'k:xxx { printf("%s/%d\n", comm, pid) ; }' + +**A Final Example** + +First, we'll need a little program to trace: + +``` +[root@kdlp ~]# cat > foo < x +cat x +EOF +[root@kdlp bpftest]# chmod +x foo + +``` + +Trigger on the read system call and print its arguments (sans user buffer), the kernel stack, and the userspace stack. + + bpftrace -e 'kprobe:__x64_sys_read /comm=="foo"/ { printf("hello read(%d, %s, %zu) %s/%d [%s] [%s]\n",arg0, str(arg1), arg2, comm, pid, kstack, ustack); }' + +This is unwieldy at the interactive shell. Let's put it in an executable file with some formatting: + +``` +[root@kdlp bpftest]# cat > kprobe_read.bp < tracepoint_read.bp <fd, str(args->buf), args->count, comm, pid, kstack, ustack); +} +EOF + +[root@kdlp bpftest]# chmod +x tracepoint_read.bp +[root@kdlp bpftest]# ./tracepoint_read.bp +Attaching 1 probe... +``` + +Finally, we invoke `./foo` in a separate shell on the same system to generate something like the following: + +``` +hello read(3, , 832) foo/38364 [] [ + __GI___read_nocancel+8 + 0x7f89f99a4af7 + 0x7f89f999e78d + 0x7f89f999d523 + 0x7f89f999ebe8 + 0x7f89f99ba10e + 0x7f89f99b6c16 + 0x7f89f99b83de + 0x7f89f99b7208 +] +hello read(3, ����, 832) foo/38364 [] [ + 0x7f89f99bd478 + 0x7f89f99a4af7 + 0x7f89f999e78d + 0x7f89f999d523 + 0x7f89f999ebe8 + 0x7f89f99ba10e + 0x7f89f99b6c16 + 0x7f89f99b83de + 0x7f89f99b7208 +] +hello read(3, , 4096) foo/38364 [] [ + 0x7f89f98891e8 + 0x7f89f9808fb9 + 0x7f89f97fb4ca + 0x7f89f98058ae + 0x7f89f97bccf8 + 0x7f89f97bd26a + 0x7f89f97b69c9 + 0x7f89f97b5f7d + 0x556ebbbad4b5 + 0x556ebbb3a703 + 0x7f89f97aab8a + 0x7f89f97aac4b + 0x556ebbb3c255 +] +hello read(3, # Locale name alias data base. +# Copyright (C) 1996-2023 Free S, 4096) foo/38364 [] [ + 0x7f89f98891e8 + 0x7f89f9808fb9 + 0x7f89f97fb4ca + 0x7f89f98058ae + 0x7f89f97bccf8 + 0x7f89f97bd26a + 0x7f89f97b69c9 + 0x7f89f97b5f7d + 0x556ebbbad4b5 + 0x556ebbb3a703 + 0x7f89f97aab8a + 0x7f89f97aac4b + 0x556ebbb3c255 +] +hello read(3, (�!, 80) foo/38364 [] [ + 0x7f89f98840c1 + 0x7f89f97aab8a + 0x7f89f97aac4b + 0x556ebbb3c255 +] +hello read(255, , 30) foo/38364 [] [ + 0x7f89f98840c1 + 0x556ebbb96628 + 0x556ebbb3f018 + 0x556ebbc185bb + 0x556ebbb43d01 + 0x556ebbb47940 + 0x556ebbb47b0c + 0x556ebbb47d2e + 0x556ebbb3b522 + 0x7f89f97aab8a + 0x7f89f97aac4b + 0x556ebbb3c255 +] +hello read(255, #!/bin/bash + +echo x > x +cat x +, 30) foo/38364 [] [ + 0x7f89f98840c1 + 0x556ebbb96628 + 0x556ebbb3f018 + 0x556ebbc185bb + 0x556ebbb43d01 + 0x556ebbb47940 + 0x556ebbb47b0c + 0x556ebbb47d2e + 0x556ebbb3b522 + 0x7f89f97aab8a + 0x7f89f97aac4b + 0x556ebbb3c255 +] + +``` + +An important advantage of the tracepoint probe type is interface stability as systemcalls do not change, but unfortunately, the kernel stack is not available to the tracepoint-type probe. There is no one-size-fits-all solution as each probe type has its pros and cons. + + +For a much more detailed demo by the bpftrace grandmaster himself (Brendan Gregg), check out this +[Kernel analysis with bpftrace](https://lwn.net/Articles/793749/) +article from +[Linux Weekly News](https://lwn.net/). +Some of the information may be out of date as the article dates to mid-2019, it's definitely worth a read for anyone who wants to learn more.