Skip to content

Latest commit

 

History

History
75 lines (55 loc) · 1.99 KB

README.md

File metadata and controls

75 lines (55 loc) · 1.99 KB

Runpod Serverless GGUF

Builds docker image for runpod serverless containers based on GGUF models from huggingface.

As of now, only supports models without split files (due to Huggingface limitation < 50GB).

Building

To build the easy way, use the python script and dynamically input the parameters:

$ python build.py
Enter MODEL (default: TheBloke/CodeLlama-34B-Instruct-GGUF):
Enter QUANTITIZATION (default: Q5_K_S):
Enter IS_70B (yes/no, default: no):
Enter HF_TOKEN:
Enter REPOSITORY_NAME (default: myuser/my-test-image):
Push Image to repository (yes/no, default: no):
Run docker as sudo (yes/no, default: no):
Building Image: myuser/my-test-image:thebloke-codellama-34b-instruct-gguf-Q5_K_S-CUBLAS

Alternatively, build it yourself:

export DOCKER_BUILDKIT=1 # Important to activate buildkit
export MODEL=TheBloke/CodeLlama-34B-Instruct-GGUF # Repository name, make sure .gguf files are existend
export QUANTITIZATION=Q5_K_S # This must match the filename for the target quantitization, will download codellama-34b-instruct.Q5_K_S.gguf
export HF_TOKEN=your_hugging_face_token_here

docker build -t runpod-images:dev . --platform linux/amd64 --build-arg HUGGING_FACE_HUB_TOKEN=$HF_TOKEN --build-arg MODEL_NAME=$MODEL --build-arg QUANTITIZATION=$QUANTITIZATION

If model is a 70b parameter model, the following build arg must be added:

--build-arg IS_70B=True

Testing Image

Create file test_input.json with input values for the model:

{
  "input": {
    "prompt": "Q: What is Python?\nA:",
    "stop": ["Q:", "\n"],
    "max_tokens": 50
  }
}

To run the image with the input:

docker run -v $(pwd)/test_input.json:/test_input.json runpod-images:dev

Usage

Following environment variables are available:

  • N_CTX: Context Size
  • NUM_GPU_SHARD: Number of GPU shards.

The POST Request input should have the structure of the input key in the test file:

{
  "prompt": "Q: What is Python?\nA:",
  "stop": ["Q:", "\n"],
  "max_tokens": 50
}