Run LLMs using distributed GPU architecture
Generate text with distributed Llama 3 , Falcon (40B+), BLOOM (176B) (or their derivatives), and fine‑tune them for your own tasks — right from your desktop computer or Google Colab:
🐧 Linux + Anaconda. Run these commands for NVIDIA GPUs (or follow this for AMD):
conda install pytorch pytorch-cuda=11.7 -c pytorch -c nvidia
pip install git+https://github.com/bigscience-workshop/petals
python -m petals.cli.run_server petals-team/StableBeluga2
🪟 Windows + WSL. Follow this guide on our Wiki.
🐋 Docker. Run our Docker image for NVIDIA GPUs (or follow this for AMD):
sudo docker run -p 31330:31330 --ipc host --gpus all --volume petals-cache:/cache --rm \
learningathome/petals:main \
python -m petals.cli.run_server --port 31330 petals-team/StableBeluga2
🍏 macOS + Apple M1/M2 GPU. Install Homebrew, then run these commands:
brew install python
python3 -m pip install git+https://github.com/bigscience-workshop/petals
python3 -m petals.cli.run_server petals-team/StableBeluga2
from transformers import AutoTokenizer
from petals import AutoDistributedModelForCausalLM
# Choose any model available at https://health.petals.dev
model_name = "petals-team/StableBeluga2" # This one is fine-tuned Llama 2 (70B)
# Connect to a distributed network hosting model layers
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoDistributedModelForCausalLM.from_pretrained(model_name)
# Run the model as if it were on your computer
inputs = tokenizer("A cat sat", return_tensors="pt")["input_ids"]
outputs = model.generate(inputs, max_new_tokens=5)
print(tokenizer.decode(outputs[0])) # A cat sat on a mat...
- You load a small part of the model, and host a private swarm, and then others can join the network
How to use:
- Getting started: tutorial
Useful tools:
- Chatbot web app (connects to Petals via an HTTP/WebSocket endpoint): source code
- Monitor for the public swarm: source code
Advanced guides:
Our results in comparison to other SOTA models currently: