Skip to content

Commit

Permalink
native support for modal cloud from CLI (#2237)
Browse files Browse the repository at this point in the history
* native support for modal cloud from CLI

* do lm_eval in cloud too

* Fix the sub call to lm-eval

* lm_eval option to not post eval, and append not extend

* cache bust when using branch, grab sha of latest image tag, update lm-eval dep

* allow minimal yaml for lm eval

* include modal in requirements

* update link in README to include utm

* pr feedback

* use chat template

* revision support

* apply chat template as arg

* add wandb name support, allow explicit a100-40gb

* cloud is optional

* handle accidental setting of tasks with a single task str

* document the modal cloud yaml for clarity [skip ci]

* cli docs

* support spawn vs remote for lm-eval

* Add support for additional docker commands in modal image build

* cloud config shouldn't be a dir

* Update README.md

Co-authored-by: Charles Frye <[email protected]>

* fix annotation args

---------

Co-authored-by: Charles Frye <[email protected]>
  • Loading branch information
winglian and charlesfrye authored Jan 30, 2025
1 parent 268543a commit 8779997
Show file tree
Hide file tree
Showing 12 changed files with 835 additions and 54 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -217,7 +217,7 @@ If you love axolotl, consider sponsoring the project by reaching out directly to

---

- [Modal](https://modal.com/) Modal lets you run data/AI jobs in the cloud, by just writing a few lines of Python. Customers use Modal to deploy Gen AI models at large scale, fine-tune LLM models, run protein folding simulations, and much more.
- [Modal](https://www.modal.com?utm_source=github&utm_medium=github&utm_campaign=axolotl) Modal lets you run data/AI jobs in the cloud, by just writing a few lines of Python. Customers use Modal to deploy Gen AI models at large scale, fine-tune large language models, run protein folding simulations, and much more.

---

Expand Down
256 changes: 256 additions & 0 deletions docs/cli.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,256 @@
# Axolotl CLI Documentation

The Axolotl CLI provides a streamlined interface for training and fine-tuning large language models. This guide covers
the CLI commands, their usage, and common examples.

### Table of Contents

- Basic Commands
- Command Reference
- fetch
- preprocess
- train
- inference
- merge-lora
- merge-sharded-fsdp-weights
- evaluate
- lm-eval
- Legacy CLI Usage
- Remote Compute with Modal Cloud
- Cloud Configuration
- Running on Modal Cloud
- Cloud Configuration Options


### Basic Commands

All Axolotl commands follow this general structure:

```bash
axolotl <command> [config.yml] [options]
```

The config file can be local or a URL to a raw YAML file.

### Command Reference

#### fetch

Downloads example configurations and deepspeed configs to your local machine.

```bash
# Get example YAML files
axolotl fetch examples

# Get deepspeed config files
axolotl fetch deepspeed_configs

# Specify custom destination
axolotl fetch examples --dest path/to/folder
```

#### preprocess

Preprocesses and tokenizes your dataset before training. This is recommended for large datasets.

```bash
# Basic preprocessing
axolotl preprocess config.yml

# Preprocessing with one GPU
CUDA_VISIBLE_DEVICES="0" axolotl preprocess config.yml

# Debug mode to see processed examples
axolotl preprocess config.yml --debug

# Debug with limited examples
axolotl preprocess config.yml --debug --debug-num-examples 5
```

Configuration options:

```yaml
dataset_prepared_path: Local folder for saving preprocessed data
push_dataset_to_hub: HuggingFace repo to push preprocessed data (optional)
```
#### train
Trains or fine-tunes a model using the configuration specified in your YAML file.
```bash
# Basic training
axolotl train config.yml

# Train and set/override specific options
axolotl train config.yml \
--learning-rate 1e-4 \
--micro-batch-size 2 \
--num-epochs 3

# Training without accelerate
axolotl train config.yml --no-accelerate

# Resume training from checkpoint
axolotl train config.yml --resume-from-checkpoint path/to/checkpoint
```

#### inference

Runs inference using your trained model in either CLI or Gradio interface mode.

```bash
# CLI inference with LoRA
axolotl inference config.yml --lora-model-dir="./outputs/lora-out"

# CLI inference with full model
axolotl inference config.yml --base-model="./completed-model"

# Gradio web interface
axolotl inference config.yml --gradio \
--lora-model-dir="./outputs/lora-out"

# Inference with input from file
cat prompt.txt | axolotl inference config.yml \
--base-model="./completed-model"
```

#### merge-lora

Merges trained LoRA adapters into the base model.

```bash
# Basic merge
axolotl merge-lora config.yml

# Specify LoRA directory (usually used with checkpoints)
axolotl merge-lora config.yml --lora-model-dir="./lora-output/checkpoint-100"

# Merge using CPU (if out of GPU memory)
CUDA_VISIBLE_DEVICES="" axolotl merge-lora config.yml
```

Configuration options:

```yaml
gpu_memory_limit: Limit GPU memory usage
lora_on_cpu: Load LoRA weights on CPU
```
#### merge-sharded-fsdp-weights
Merges sharded FSDP model checkpoints into a single combined checkpoint.
```bash
# Basic merge
axolotl merge-sharded-fsdp-weights config.yml
```

#### evaluate

Evaluates a model's performance using metrics specified in the config.

```bash
# Basic evaluation
axolotl evaluate config.yml
```

#### lm-eval

Runs LM Evaluation Harness on your model.

```bash
# Basic evaluation
axolotl lm-eval config.yml

# Evaluate specific tasks
axolotl lm-eval config.yml --tasks arc_challenge,hellaswag
```

Configuration options:

```yaml
lm_eval_tasks: List of tasks to evaluate
lm_eval_batch_size: Batch size for evaluation
output_dir: Directory to save evaluation results
```
### Legacy CLI Usage
While the new Click-based CLI is preferred, Axolotl still supports the legacy module-based CLI:
```bash
# Preprocess
python -m axolotl.cli.preprocess config.yml

# Train
accelerate launch -m axolotl.cli.train config.yml

# Inference
accelerate launch -m axolotl.cli.inference config.yml \
--lora_model_dir="./outputs/lora-out"

# Gradio interface
accelerate launch -m axolotl.cli.inference config.yml \
--lora_model_dir="./outputs/lora-out" --gradio
```

### Remote Compute with Modal Cloud

Axolotl supports running training and inference workloads on Modal cloud infrastructure. This is configured using a
cloud YAML file alongside your regular Axolotl config.

#### Cloud Configuration

Create a cloud config YAML with your Modal settings:

```yaml
# cloud_config.yml
provider: modal
gpu: a100 # Supported: l40s, a100-40gb, a100-80gb, a10g, h100, t4, l4
gpu_count: 1 # Number of GPUs to use
timeout: 86400 # Maximum runtime in seconds (24 hours)
branch: main # Git branch to use (optional)

volumes: # Persistent storage volumes
- name: axolotl-cache
mount: /workspace/cache

env: # Environment variables
- WANDB_API_KEY
- HF_TOKEN
```
#### Running on Modal Cloud
Commands that support the --cloud flag:
```bash
# Preprocess on cloud
axolotl preprocess config.yml --cloud cloud_config.yml

# Train on cloud
axolotl train config.yml --cloud cloud_config.yml

# Train without accelerate on cloud
axolotl train config.yml --cloud cloud_config.yml --no-accelerate

# Run lm-eval on cloud
axolotl lm-eval config.yml --cloud cloud_config.yml
```

#### Cloud Configuration Options

```yaml
provider: compute provider, currently only `modal` is supported
gpu: GPU type to use
gpu_count: Number of GPUs (default: 1)
memory: RAM in GB (default: 128)
timeout: Maximum runtime in seconds
timeout_preprocess: Preprocessing timeout
branch: Git branch to use
docker_tag: Custom Docker image tag
volumes: List of persistent storage volumes
env: Environment variables to pass
secrets: Secrets to inject
```
28 changes: 28 additions & 0 deletions examples/cloud/modal.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
project_name:
volumes:
- name: axolotl-data
mount: /workspace/data
- name: axolotl-artifacts
mount: /workspace/artifacts

# environment variables from local to set as secrets
secrets:
- HF_TOKEN
- WANDB_API_KEY

# Which branch of axolotl to use remotely
branch:

# additional custom commands when building the image
dockerfile_commands:

gpu: h100
gpu_count: 1

# Train specific configurations
memory: 128
timeout: 86400

# Preprocess specific configurations
memory_preprocess: 32
timeout_preprocess: 14400
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ hf_transfer
sentencepiece
gradio==3.50.2

modal==0.70.5
pydantic==2.6.3
addict
fire
Expand Down
17 changes: 11 additions & 6 deletions scripts/motd
Original file line number Diff line number Diff line change
@@ -1,10 +1,15 @@

dP dP dP
88 88 88
.d8888b. dP. .dP .d8888b. 88 .d8888b. d8888P 88
88' `88 `8bd8' 88' `88 88 88' `88 88 88
88. .88 .d88b. 88. .88 88 88. .88 88 88
`88888P8 dP' `dP `88888P' dP `88888P' dP dP
#@@ #@@ @@# @@#
@@ @@ @@ @@ =@@# @@ #@ =@@#.
@@ #@@@@@@@@@ @@ #@#@= @@ #@ .=@@
#@@@@@@@@@@@@@@@@@ =@# @# ##= ## =####=+ @@ =#####+ =#@@###. @@
@@@@@@@@@@/ +@@/ +@@ #@ =@= #@= @@ =@#+ +#@# @@ =@#+ +#@# #@. @@
@@@@@@@@@@ ##@@ ##@@ =@# @# =@# @# @@ @@ @@ @@ #@ #@ @@
@@@@@@@@@@@@@@@@@@@@ #@=+++#@= =@@# @@ @@ @@ @@ #@ #@ @@
=@#=====@@ =@# @# @@ @@ @@ @@ #@ #@ @@
@@@@@@@@@@@@@@@@ @@@@ #@ #@= #@= +@@ #@# =@# @@. =@# =@# #@. @@
=@# @# #@= #@ =#@@@@#= +#@@= +#@@@@#= .##@@+ @@
@@@@ @@@@@@@@@@@@@@@@

Welcome to the axolotl cloud image! If the you've mounted a disk to /workspace and the axolotl directory ie empty, run the following commands:

Expand Down
56 changes: 56 additions & 0 deletions src/axolotl/cli/cloud/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
"""
launch axolotl in supported cloud platforms
"""
from pathlib import Path
from typing import Union

import yaml

from axolotl.cli.art import print_axolotl_text_art
from axolotl.cli.cloud.modal_ import ModalCloud
from axolotl.utils.dict import DictDefault


def load_cloud_cfg(cloud_config: Union[Path, str]) -> DictDefault:
"""Load and validate cloud configuration."""
# Load cloud configuration.
with open(cloud_config, encoding="utf-8") as file:
cloud_cfg: DictDefault = DictDefault(yaml.safe_load(file))
return cloud_cfg


def do_cli_preprocess(
cloud_config: Union[Path, str],
config: Union[Path, str],
) -> None:
print_axolotl_text_art()
cloud_cfg = load_cloud_cfg(cloud_config)
cloud = ModalCloud(cloud_cfg)
with open(config, "r", encoding="utf-8") as file:
config_yaml = file.read()
cloud.preprocess(config_yaml)


def do_cli_train(
cloud_config: Union[Path, str],
config: Union[Path, str],
accelerate: bool = True,
) -> None:
print_axolotl_text_art()
cloud_cfg = load_cloud_cfg(cloud_config)
cloud = ModalCloud(cloud_cfg)
with open(config, "r", encoding="utf-8") as file:
config_yaml = file.read()
cloud.train(config_yaml, accelerate=accelerate)


def do_cli_lm_eval(
cloud_config: Union[Path, str],
config: Union[Path, str],
) -> None:
print_axolotl_text_art()
cloud_cfg = load_cloud_cfg(cloud_config)
cloud = ModalCloud(cloud_cfg)
with open(config, "r", encoding="utf-8") as file:
config_yaml = file.read()
cloud.lm_eval(config_yaml)
Loading

0 comments on commit 8779997

Please sign in to comment.