instructlab · mergify · Sep 26, 2024 · Sep 20, 2024
diff --git a/README.md b/README.md
@@ -5,43 +5,56 @@
 ![Release](https://img.shields.io/github/v/release/instructlab/training)
 ![License](https://img.shields.io/github/license/instructlab/training)
 
-In order to simplify the process of fine-tuning models through the LAB
-method, this library provides a simple training interface.
+- [Installing](#installing-the-library)
+  - [Additional Nvidia packages](#additional-nvidia-packages)
+- [Using the library](#using-the-library)
+- [Learning about the training arguments](#learning-about-training-arguments)
+  - [`TrainingArgs`](#trainingargs)
+  - [`DeepSpeedOptions`](#deepspeedoptions)
+  - [`FSDPOptions`](#fsdpoptions)
+  - [`loraOptions`](#loraoptions)
+- [Learning about `TorchrunArgs` arguments](#learning-about-torchrunargs-arguments)
+- [Example training run with arguments](#example-training-run-with-arguments)
 
-## Installation
+To simplify the process of fine-tuning models with the [LAB
+method](https://arxiv.org/abs/2403.01081), this library provides a simple training interface.
 
-To get started with the library, you must clone this repo and install it from source via `pip`:
+## Installing the library
 
-```bash
-# clone the repo and switch to the directory
-git clone https://github.com/instructlab/training
-cd training
+To get started with the library, you must clone this repository and install it via `pip`.
+
+Install the library:
 
-# install the library
-pip install .
+```bash
+pip install instructlab-training 
 ```
 
-For development, install it instead with `pip install -e .` instead
-to make local changes while using this library elsewhere.
+You can then install the library for development:
 
-### Installing Additional NVIDIA packages
+```bash
+pip install -e ./training
+```
 
-We make use of `flash-attn` and other packages which rely on NVIDIA-specific
-CUDA tooling to be installed.
+### Additional NVIDIA packages
 
-If you are using NVIDIA hardware with CUDA, please install the additional dependencies via:
+This library uses the `flash-attn` package as well as other packages, which rely on NVIDIA-specific CUDA tooling to be installed.
+If you are using NVIDIA hardware with CUDA, you need to install the following additional dependencies.
+
+Basic install
 
 ```bash
-# for a regular install
 pip install .[cuda]
+```
 
-# or, for an editable install (development)
+Editable install (development)
+
+```bash
 pip install -e .[cuda]
 ```
 
-## Usage
+## Using the library
 
-Using the library is fairly straightforward, import the necessary items,
+You can utilize this training library by importing the necessary items.
 
 ```py
 from instructlab.training import (
@@ -52,65 +65,18 @@
 )
 ```
 
-Then, define the training arguments which will serve as the
-parameters for our training run:
+You can then define various training arguments. They will serve as the parameters for your training runs. See:
 
-```py
-# define training-specific arguments
-training_args = TrainingArgs(
-    # define data-specific arguments
-    model_path = "ibm-granite/granite-7b-base",
-    data_path = "path/to/dataset.jsonl",
-    ckpt_output_dir = "data/saved_checkpoints",
-    data_output_dir = "data/outputs",
-
-    # define model-trianing parameters
-    max_seq_len = 4096,
-    max_batch_len = 60000,
-    num_epochs = 10,
-    effective_batch_size = 3840,
-    save_samples = 250000,
-    learning_rate = 2e-6,
-    warmup_steps = 800,
-    is_padding_free = True, # set this to true when using Granite-based models
-    random_seed = 42,
-)
-```
+- [Learning about the training argument](#learning-about-training-arguments)
+- [Example training run with arguments](#example-training-run-with-arguments)
 
-We'll also need to define the settings for running a multi-process job
-via `torchrun`. To do this, create a `TorchrunArgs` object.
-
-> [!TIP]
-> Note, for single-GPU jobs, you can simply set `nnodes = 1` and `nproc_per_node=1`.
-
-```py
-torchrun_args = TorchrunArgs(
-    nnodes = 1, # number of machines 
-    nproc_per_node = 8, # num GPUs per machine
-    node_rank = 0, # node rank for this machine
-    rdzv_id = 123,
-    rdzv_endpoint = '127.0.0.1:12345'
-)
-```
-
-Finally, you can just call `run_training` and this library will handle
-the rest 🙂.
-
-```py
-run_training(
-    torchrun_args=torchrun_args,
-    training_args=training_args,
-)
-
-```
-
-### Customizing `TrainingArgs`
+## Learning about training arguments
 
 The `TrainingArgs` class provides most of the customization options
-for the training job itself. There are a number of options you can specify, such as setting
-DeepSpeed config values or running a LoRA training job instead of a full fine-tune.
+for training jobs. There are a number of options you can specify, such as setting
+`DeepSpeed` config values or running a `LoRA` training job instead of a full fine-tune.
 
-Here is a breakdown of the general options:
+### `TrainingArgs`
 
 | Field | Description |
 | --- | --- |
@@ -137,9 +103,9 @@
 | distributed_backend | Specifies which distributed training backend to use. Supported options are "fsdp" and "deepspeed". |
 | disable_flash_attn | Disables flash attention when set to true. This allows for training on older devices. |
 
-#### `DeepSpeedOptions`
+### `DeepSpeedOptions`
 
-We only currently support a few options in `DeepSpeedOptions`:
+This library only currently support a few options in `DeepSpeedOptions`:
 The default is to run with DeepSpeed, so these options only currently
 allow you to customize aspects of the ZeRO stage 2 optimizer.
 
@@ -150,6 +116,8 @@
 | cpu_offload_optimizer_pin_memory | If true, offload to page-locked CPU memory. This could boost throughput at the cost of extra memory overhead. |
 | save_samples | The number of samples to see before saving a DeepSpeed checkpoint. |
 
+For more information about DeepSpeed, see [deepspeed.ai](https://www.deepspeed.ai/)
+
 #### `FSDPOptions`
 
 Like DeepSpeed, we only expose a number of parameters for you to modify with FSDP.
@@ -162,8 +130,19 @@
 
 > [!NOTE]
 > For `sharding_strategy` - Only `SHARD_GRAD_OP` has been extensively tested and is actively supported by this library.
+### `loraOptions`
 
-#### `LoraOptions`
+LoRA options currently supported:
+
+| Field | Description |
+| --- | --- |
+| rank | The rank parameter for LoRA training. |
+| alpha | The alpha parameter for LoRA training. |
+| dropout | The dropout rate for LoRA training. |
+| target_modules | The list of target modules for LoRA training. |
+| quantize_data_type | The data type for quantization in LoRA training. Valid options are `None` and `"nf4"` |
+
+#### Example run with LoRa options
 
 If you'd like to do a LoRA train, you can specify a LoRA
 option to `TrainingArgs` via the `LoraOptions` object.
@@ -181,23 +160,12 @@
 )
 ```
 
-Here is the definition for what we currently support today:
-
-| Field | Description |
-| --- | --- |
-| rank | The rank parameter for LoRA training. |
-| alpha | The alpha parameter for LoRA training. |
-| dropout | The dropout rate for LoRA training. |
-| target_modules | The list of target modules for LoRA training. |
-| quantize_data_type | The data type for quantization in LoRA training. Valid options are `None` and `"nf4"` |
-
-### Customizing `TorchrunArgs`
+### Learning about `TorchrunArgs` arguments
 
 When running the training script, we always invoke `torchrun`.
 
 If you are running a single-GPU system or something that doesn't
-otherwise require distributed training configuration, you can
-just create a default object:
+otherwise require distributed training configuration, you can create a default object:
 
 ```python
 run_training(
@@ -209,12 +177,14 @@
 ```
 
 However, if you want to specify a more complex configuration,
-we currently expose all of the options that [torchrun accepts
+the library currently supports all the options that [torchrun accepts
 today](https://pytorch.org/docs/stable/elastic/run.html#definitions).
 
-> ![NOTE]
+> [!NOTE]
 > For more information about the `torchrun` arguments, please consult the [torchrun documentation](https://pytorch.org/docs/stable/elastic/run.html#definitions).
 
+#### Example training run with `TorchrunArgs` arguments
+
 For example, in a 8-GPU, 2-machine system, we would
 specify the following torchrun config:
 
@@ -257,3 +227,55 @@
     train_args=training_args
 )
 ```
+
+## Example training run with arguments
+
+Define the training arguments which will serve as the
+parameters for our training run:
+
+```py
+# define training-specific arguments
+training_args = TrainingArgs(
+    # define data-specific arguments
+    model_path = "ibm-granite/granite-7b-base",
+    data_path = "path/to/dataset.jsonl",
+    ckpt_output_dir = "data/saved_checkpoints",
+    data_output_dir = "data/outputs",
+
+    # define model-trianing parameters
+    max_seq_len = 4096,
+    max_batch_len = 60000,
+    num_epochs = 10,
+    effective_batch_size = 3840,
+    save_samples = 250000,
+    learning_rate = 2e-6,
+    warmup_steps = 800,
+    is_padding_free = True, # set this to true when using Granite-based models
+    random_seed = 42,
+)
+```
+
+We'll also need to define the settings for running a multi-process job
+via `torchrun`. To do this, create a `TorchrunArgs` object.
+
+> [!TIP]
+> Note, for single-GPU jobs, you can simply set `nnodes = 1` and `nproc_per_node=1`.
+
+```py
+torchrun_args = TorchrunArgs(
+    nnodes = 1, # number of machines 
+    nproc_per_node = 8, # num GPUs per machine
+    node_rank = 0, # node rank for this machine
+    rdzv_id = 123,
+    rdzv_endpoint = '127.0.0.1:12345'
+)
+```
+
+Finally, you can just call `run_training` and this library will handle
+the rest 🙂.
+
+```py
+run_training(
+    torchrun_args=torchrun_args,
+    training_args=training_args,
+)