Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: Update training README #218

Merged
merged 1 commit into from
Sep 26, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
206 changes: 114 additions & 92 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,43 +5,56 @@
![Release](https://img.shields.io/github/v/release/instructlab/training)
![License](https://img.shields.io/github/license/instructlab/training)

In order to simplify the process of fine-tuning models through the LAB
method, this library provides a simple training interface.
- [Installing](#installing-the-library)
kelbrown20 marked this conversation as resolved.
Show resolved Hide resolved
- [Additional Nvidia packages](#additional-nvidia-packages)
- [Using the library](#using-the-library)
- [Learning about the training arguments](#learning-about-training-arguments)
- [`TrainingArgs`](#trainingargs)
- [`DeepSpeedOptions`](#deepspeedoptions)
- [`FSDPOptions`](#fsdpoptions)
- [`loraOptions`](#loraoptions)
- [Learning about `TorchrunArgs` arguments](#learning-about-torchrunargs-arguments)
- [Example training run with arguments](#example-training-run-with-arguments)

## Installation
To simplify the process of fine-tuning models with the [LAB
method](https://arxiv.org/abs/2403.01081), this library provides a simple training interface.

To get started with the library, you must clone this repo and install it from source via `pip`:
## Installing the library

```bash
# clone the repo and switch to the directory
git clone https://github.com/instructlab/training
cd training
To get started with the library, you must clone this repository and install it via `pip`.
kelbrown20 marked this conversation as resolved.
Show resolved Hide resolved

Install the library:

# install the library
pip install .
```bash
kelbrown20 marked this conversation as resolved.
Show resolved Hide resolved
pip install instructlab-training
```

For development, install it instead with `pip install -e .` instead
to make local changes while using this library elsewhere.
You can then install the library for development:

### Installing Additional NVIDIA packages
```bash
pip install -e ./training
```

We make use of `flash-attn` and other packages which rely on NVIDIA-specific
CUDA tooling to be installed.
### Additional NVIDIA packages

If you are using NVIDIA hardware with CUDA, please install the additional dependencies via:
This library uses the `flash-attn` package as well as other packages, which rely on NVIDIA-specific CUDA tooling to be installed.
If you are using NVIDIA hardware with CUDA, you need to install the following additional dependencies.

Basic install

```bash
# for a regular install
pip install .[cuda]
```

# or, for an editable install (development)
Editable install (development)

```bash
pip install -e .[cuda]
```

## Usage
## Using the library

Using the library is fairly straightforward, import the necessary items,
You can utilize this training library by importing the necessary items.

```py
from instructlab.training import (
Expand All @@ -52,65 +65,18 @@
)
```

Then, define the training arguments which will serve as the
parameters for our training run:
You can then define various training arguments. They will serve as the parameters for your training runs. See:

```py
# define training-specific arguments
training_args = TrainingArgs(
# define data-specific arguments
model_path = "ibm-granite/granite-7b-base",
data_path = "path/to/dataset.jsonl",
ckpt_output_dir = "data/saved_checkpoints",
data_output_dir = "data/outputs",

# define model-trianing parameters
max_seq_len = 4096,
max_batch_len = 60000,
num_epochs = 10,
effective_batch_size = 3840,
save_samples = 250000,
learning_rate = 2e-6,
warmup_steps = 800,
is_padding_free = True, # set this to true when using Granite-based models
random_seed = 42,
)
```
- [Learning about the training argument](#learning-about-training-arguments)
- [Example training run with arguments](#example-training-run-with-arguments)

We'll also need to define the settings for running a multi-process job
via `torchrun`. To do this, create a `TorchrunArgs` object.

> [!TIP]
> Note, for single-GPU jobs, you can simply set `nnodes = 1` and `nproc_per_node=1`.

```py
torchrun_args = TorchrunArgs(
nnodes = 1, # number of machines
nproc_per_node = 8, # num GPUs per machine
node_rank = 0, # node rank for this machine
rdzv_id = 123,
rdzv_endpoint = '127.0.0.1:12345'
)
```

Finally, you can just call `run_training` and this library will handle
the rest 🙂.

```py
run_training(
torchrun_args=torchrun_args,
training_args=training_args,
)

```

### Customizing `TrainingArgs`
## Learning about training arguments

The `TrainingArgs` class provides most of the customization options
for the training job itself. There are a number of options you can specify, such as setting
DeepSpeed config values or running a LoRA training job instead of a full fine-tune.
for training jobs. There are a number of options you can specify, such as setting
`DeepSpeed` config values or running a `LoRA` training job instead of a full fine-tune.

Here is a breakdown of the general options:
### `TrainingArgs`

| Field | Description |
| --- | --- |
Expand All @@ -137,9 +103,9 @@
| distributed_backend | Specifies which distributed training backend to use. Supported options are "fsdp" and "deepspeed". |
| disable_flash_attn | Disables flash attention when set to true. This allows for training on older devices. |

#### `DeepSpeedOptions`
### `DeepSpeedOptions`

We only currently support a few options in `DeepSpeedOptions`:
This library only currently support a few options in `DeepSpeedOptions`:
The default is to run with DeepSpeed, so these options only currently
allow you to customize aspects of the ZeRO stage 2 optimizer.

Expand All @@ -150,6 +116,8 @@
| cpu_offload_optimizer_pin_memory | If true, offload to page-locked CPU memory. This could boost throughput at the cost of extra memory overhead. |
| save_samples | The number of samples to see before saving a DeepSpeed checkpoint. |

For more information about DeepSpeed, see [deepspeed.ai](https://www.deepspeed.ai/)

#### `FSDPOptions`

Like DeepSpeed, we only expose a number of parameters for you to modify with FSDP.
Expand All @@ -162,8 +130,19 @@

> [!NOTE]
> For `sharding_strategy` - Only `SHARD_GRAD_OP` has been extensively tested and is actively supported by this library.
### `loraOptions`

Check failure on line 133 in README.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Headings should be surrounded by blank lines

README.md:133 MD022/blanks-around-headings Headings should be surrounded by blank lines [Expected: 1; Actual: 0; Above] [Context: "### `loraOptions`"] https://github.com/DavidAnson/markdownlint/blob/v0.35.0/doc/md022.md
kelbrown20 marked this conversation as resolved.
Show resolved Hide resolved

#### `LoraOptions`
LoRA options currently supported:

| Field | Description |
| --- | --- |
| rank | The rank parameter for LoRA training. |
| alpha | The alpha parameter for LoRA training. |
| dropout | The dropout rate for LoRA training. |
| target_modules | The list of target modules for LoRA training. |
| quantize_data_type | The data type for quantization in LoRA training. Valid options are `None` and `"nf4"` |

#### Example run with LoRa options

If you'd like to do a LoRA train, you can specify a LoRA
option to `TrainingArgs` via the `LoraOptions` object.
Expand All @@ -181,23 +160,12 @@
)
```

Here is the definition for what we currently support today:

| Field | Description |
| --- | --- |
| rank | The rank parameter for LoRA training. |
| alpha | The alpha parameter for LoRA training. |
| dropout | The dropout rate for LoRA training. |
| target_modules | The list of target modules for LoRA training. |
| quantize_data_type | The data type for quantization in LoRA training. Valid options are `None` and `"nf4"` |

### Customizing `TorchrunArgs`
### Learning about `TorchrunArgs` arguments

When running the training script, we always invoke `torchrun`.

If you are running a single-GPU system or something that doesn't
otherwise require distributed training configuration, you can
just create a default object:
otherwise require distributed training configuration, you can create a default object:

```python
run_training(
Expand All @@ -209,12 +177,14 @@
```

However, if you want to specify a more complex configuration,
we currently expose all of the options that [torchrun accepts
the library currently supports all the options that [torchrun accepts
today](https://pytorch.org/docs/stable/elastic/run.html#definitions).

> ![NOTE]
> [!NOTE]
> For more information about the `torchrun` arguments, please consult the [torchrun documentation](https://pytorch.org/docs/stable/elastic/run.html#definitions).

#### Example training run with `TorchrunArgs` arguments

For example, in a 8-GPU, 2-machine system, we would
specify the following torchrun config:

Expand Down Expand Up @@ -257,3 +227,55 @@
train_args=training_args
)
```

## Example training run with arguments

Define the training arguments which will serve as the
parameters for our training run:

```py
# define training-specific arguments
training_args = TrainingArgs(
# define data-specific arguments
model_path = "ibm-granite/granite-7b-base",
data_path = "path/to/dataset.jsonl",
ckpt_output_dir = "data/saved_checkpoints",
data_output_dir = "data/outputs",

# define model-trianing parameters
max_seq_len = 4096,
max_batch_len = 60000,
num_epochs = 10,
effective_batch_size = 3840,
save_samples = 250000,
learning_rate = 2e-6,
warmup_steps = 800,
is_padding_free = True, # set this to true when using Granite-based models
random_seed = 42,
)
```

We'll also need to define the settings for running a multi-process job
via `torchrun`. To do this, create a `TorchrunArgs` object.

> [!TIP]
> Note, for single-GPU jobs, you can simply set `nnodes = 1` and `nproc_per_node=1`.

```py
torchrun_args = TorchrunArgs(
nnodes = 1, # number of machines
nproc_per_node = 8, # num GPUs per machine
node_rank = 0, # node rank for this machine
rdzv_id = 123,
rdzv_endpoint = '127.0.0.1:12345'
)
```

Finally, you can just call `run_training` and this library will handle
the rest 🙂.
kelbrown20 marked this conversation as resolved.
Show resolved Hide resolved

```py
run_training(
torchrun_args=torchrun_args,
training_args=training_args,
)
Loading