Skip to content

Commit

Permalink
Docs: Update training README
Browse files Browse the repository at this point in the history
Signed-off-by: Kelly Brown <[email protected]>
  • Loading branch information
kelbrown20 committed Sep 25, 2024
1 parent 99d4468 commit 1c95f93
Showing 1 changed file with 114 additions and 92 deletions.
206 changes: 114 additions & 92 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,43 +5,55 @@
![Release](https://img.shields.io/github/v/release/instructlab/training)
![License](https://img.shields.io/github/license/instructlab/training)

In order to simplify the process of fine-tuning models through the LAB
method, this library provides a simple training interface.
- [Installing](#installing-the-library)
- [Additional Nvidia packages](#additional-nvidia-packages)
- [Using the library](#using-the-library)
- [Learning about the training arguments](#learning-about-training-arguments)
- [`TrainingArgs`](#trainingargs)
- [`DeepSpeedOptions`](#deepspeedoptions)
- [`loraOptions`](#loraoptions)
- [Learning about `TorchrunArgs` arguments](#learning-about-torchrunargs-arguments)
- [Example training run with arguments](#example-training-run-with-arguments)

## Installation
To simplify the process of fine-tuning models with the [LAB
method](https://arxiv.org/abs/2403.01081), this library provides a simple training interface.

To get started with the library, you must clone this repo and install it from source via `pip`:
## Installing the library

```bash
# clone the repo and switch to the directory
git clone https://github.com/instructlab/training
cd training
To get started with the library, you must clone this repository and install it via `pip`.

Install the library:

# install the library
pip install .
```bash
pip install instructlab-training
```

For development, install it instead with `pip install -e .` instead
to make local changes while using this library elsewhere.
You can then install the library for development:

Check failure on line 31 in README.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Trailing spaces

README.md:31:50 MD009/no-trailing-spaces Trailing spaces [Expected: 0 or 2; Actual: 1] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md009.md

```bash
pip install -e ./training
```

### Installing Additional NVIDIA packages
### Additional NVIDIA packages

We make use of `flash-attn` and other packages which rely on NVIDIA-specific
CUDA tooling to be installed.
This library uses the `flash-attn` package as well as other packages, which rely on NVIDIA-specific CUDA tooling to be installed.
If you are using NVIDIA hardware with CUDA, you need to install the following additional dependencies.

If you are using NVIDIA hardware with CUDA, please install the additional dependencies via:
Basic install

```bash
# for a regular install
pip install .[cuda]
```

Editable install (development)

# or, for an editable install (development)
```bash
pip install -e .[cuda]
```

## Usage
## Using the library

Using the library is fairly straightforward, import the necessary items,
You can utilize this training library by importing the necessary items.

```py
from instructlab.training import (
Expand All @@ -52,65 +64,18 @@ from instructlab.training import (
)
```

Then, define the training arguments which will serve as the
parameters for our training run:

```py
# define training-specific arguments
training_args = TrainingArgs(
# define data-specific arguments
model_path = "ibm-granite/granite-7b-base",
data_path = "path/to/dataset.jsonl",
ckpt_output_dir = "data/saved_checkpoints",
data_output_dir = "data/outputs",

# define model-trianing parameters
max_seq_len = 4096,
max_batch_len = 60000,
num_epochs = 10,
effective_batch_size = 3840,
save_samples = 250000,
learning_rate = 2e-6,
warmup_steps = 800,
is_padding_free = True, # set this to true when using Granite-based models
random_seed = 42,
)
```

We'll also need to define the settings for running a multi-process job
via `torchrun`. To do this, create a `TorchrunArgs` object.

> [!TIP]
> Note, for single-GPU jobs, you can simply set `nnodes = 1` and `nproc_per_node=1`.
```py
torchrun_args = TorchrunArgs(
nnodes = 1, # number of machines
nproc_per_node = 8, # num GPUs per machine
node_rank = 0, # node rank for this machine
rdzv_id = 123,
rdzv_endpoint = '127.0.0.1:12345'
)
```

Finally, you can just call `run_training` and this library will handle
the rest 🙂.

```py
run_training(
torchrun_args=torchrun_args,
training_args=training_args,
)
You can then define various training arguments. They will serve as the parameters for your training runs. See:

```
- [Learning about the training argument](#learning-about-training-arguments)
- [Example training run with arguments](#example-training-run-with-arguments)

### Customizing `TrainingArgs`
## Learning about training arguments

The `TrainingArgs` class provides most of the customization options
for the training job itself. There are a number of options you can specify, such as setting
DeepSpeed config values or running a LoRA training job instead of a full fine-tune.
for training jobs. There are a number of options you can specify, such as setting
`DeepSpeed` config values or running a `LoRA` training job instead of a full fine-tune.

Here is a breakdown of the general options:
### `TrainingArgs`

| Field | Description |
| --- | --- |
Expand All @@ -132,17 +97,31 @@ Here is a breakdown of the general options:
| deepspeed_options | Config options to specify for the DeepSpeed optimizer. |
| lora | Options to specify if you intend to perform a LoRA train instead of a full fine-tune. |

#### `DeepSpeedOptions`
### `DeepSpeedOptions`

We only currently support a few options in `DeepSpeedOptions`:
This library only currently support a few options in `DeepSpeedOptions`:
The default is to run with DeepSpeed, so these options only currently
allow you to customize aspects of the ZeRO stage 2 optimizer.

| Field | Description |
| --- | --- |
| cpu_offload_optimizer | Whether or not to do CPU offloading in DeepSpeed stage 2. |

#### `loraOptions`
For more information about DeepSpeed, see [deepspeed.ai](https://www.deepspeed.ai/)

### `loraOptions`

LoRA options currently supported:

| Field | Description |
| --- | --- |
| rank | The rank parameter for LoRA training. |
| alpha | The alpha parameter for LoRA training. |
| dropout | The dropout rate for LoRA training. |
| target_modules | The list of target modules for LoRA training. |
| quantize_data_type | The data type for quantization in LoRA training. Valid options are `None` and `"nf4"` |

#### Example run with LoRa options

If you'd like to do a LoRA train, you can specify a LoRA
option to `TrainingArgs` via the `LoraOptions` object.
Expand All @@ -160,23 +139,12 @@ training_args = TrainingArgs(
)
```

Here is the definition for what we currently support today:

| Field | Description |
| --- | --- |
| rank | The rank parameter for LoRA training. |
| alpha | The alpha parameter for LoRA training. |
| dropout | The dropout rate for LoRA training. |
| target_modules | The list of target modules for LoRA training. |
| quantize_data_type | The data type for quantization in LoRA training. Valid options are `None` and `"nf4"` |

### Customizing `TorchrunArgs`
### Learning about `TorchrunArgs` arguments

When running the training script, we always invoke `torchrun`.

If you are running a single-GPU system or something that doesn't
otherwise require distributed training configuration, you can
just create a default object:
otherwise require distributed training configuration, you can create a default object:

```python
run_training(
Expand All @@ -188,12 +156,14 @@ run_training(
```

However, if you want to specify a more complex configuration,
we currently expose all of the options that [torchrun accepts
the library currently supports all the options that [torchrun accepts
today](https://pytorch.org/docs/stable/elastic/run.html#definitions).

> ![NOTE]
> [!NOTE]
> For more information about the `torchrun` arguments, please consult the [torchrun documentation](https://pytorch.org/docs/stable/elastic/run.html#definitions).
#### Example training run with `TorchrunArgs` arguments

For example, in a 8-GPU, 2-machine system, we would
specify the following torchrun config:

Expand Down Expand Up @@ -236,3 +206,55 @@ run_training(
train_args=training_args
)
```

## Example training run with arguments

Define the training arguments which will serve as the
parameters for our training run:

```py
# define training-specific arguments
training_args = TrainingArgs(
# define data-specific arguments
model_path = "ibm-granite/granite-7b-base",
data_path = "path/to/dataset.jsonl",
ckpt_output_dir = "data/saved_checkpoints",
data_output_dir = "data/outputs",

# define model-trianing parameters
max_seq_len = 4096,
max_batch_len = 60000,
num_epochs = 10,
effective_batch_size = 3840,
save_samples = 250000,
learning_rate = 2e-6,
warmup_steps = 800,
is_padding_free = True, # set this to true when using Granite-based models
random_seed = 42,
)
```

We'll also need to define the settings for running a multi-process job
via `torchrun`. To do this, create a `TorchrunArgs` object.

> [!TIP]
> Note, for single-GPU jobs, you can simply set `nnodes = 1` and `nproc_per_node=1`.
```py
torchrun_args = TorchrunArgs(
nnodes = 1, # number of machines
nproc_per_node = 8, # num GPUs per machine
node_rank = 0, # node rank for this machine
rdzv_id = 123,
rdzv_endpoint = '127.0.0.1:12345'
)
```

Finally, you can just call `run_training` and this library will handle
the rest 🙂.

```py
run_training(
torchrun_args=torchrun_args,
training_args=training_args,
)

0 comments on commit 1c95f93

Please sign in to comment.