gabby

a llama 3 inference server from scratch in c++ and cuda

currently under active development, see status below

my december 2024 project, i've retired from advent of code :)

purpose

a learning exercise, inspired by:

status

other improvements:

backpressure w/http 529
streaming w/server-side events
add /statusz with metrics etc.
revisit concurrency

prerequisites

x86_64
ubuntu 24.04
gcc 13.2
cmake 3.22
nvidia gpu with at least 3gb vram
nvidia cuda toolkit: https://developer.nvidia.com/cuda-toolkit
Llama3.2-1B-Instruct SafeTensors from Hugging Face:
- Create a Hugging Face account at https://huggingface.co/
- Agree to the license agreement at https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct
- Wait for approval; check status at https://huggingface.co/settings/gated-repos
- Install the Hugging Face CLI: pip install -U "huggingface_hub[cli]"
- Log in via the CLI: huggingface-cli login. You'll have to paste a token (or create one with at least read access if it doesn't exist) at https://huggingface.co/settings/tokens
- Download the repository: huggingface-cli download meta-llama/Llama-3.2-1B-Instruct
- We assume the files are downloaded and stored using the default Hugging Face cache, i.e. at $HOME/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct/snapshots/$SHA/. The server will look for this directory (using the first SHA encountered) at startup; if you want to override this path, use the flag --model-dir.

getting started

cmake -S . -B build     # prepare build
cmake --build build     # build everything
./build/gabby_test      # run tests
./build/gabby --debug --port 8080 --workers 7

this will start the server running on localhost at port 8080 with seven worker threads and DEBUG level logging. note that it supports graceful shutdown via SIGINT and SIGTERM.

while it's runnning, you can call the chat completion api:

curl localhost:8080/v1/chat/completions -d '{
    "model": "gabby-1",
    "messages": [{
        "role": "system",
        "content": "You are a helpful assistant."
    },{
        "role": "user",
        "content": "Hello!"
    }]
}'

or use an openai-compatible chat app (like boltai for mac).

license

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gabby

purpose

status

prerequisites

getting started

license

About

Releases

Packages

Languages

License

dhconnelly/gabby

Folders and files

Latest commit

History

Repository files navigation

gabby

purpose

status

prerequisites

getting started

license

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages