Skip to content

Commit

Permalink
LoRA (#16)
Browse files Browse the repository at this point in the history
Low Ranking Adaptation implementation
  • Loading branch information
Andrei-Aksionov authored May 10, 2023
1 parent f3daa68 commit fb208fb
Show file tree
Hide file tree
Showing 15 changed files with 703 additions and 17 deletions.
13 changes: 6 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,22 @@

<p>
<h2 align="center">Welcome to NanoGPT+ in PyTorch</h2>
<h5 align="center">Knock-off edition (but with enchantments)<h5>
<h2 align="center">Welcome to NanoGPT+</h2>
<h4 align="center">Knock-off edition</h4>
<h6 align="center">but with enchantments</h6>
</p>

[![Python 3.8](https://img.shields.io/badge/python-3.8-blue.svg)](https://www.python.org/downloads/release/python-380/)
[![Python 3.9](https://img.shields.io/badge/python-3.9-blue.svg)](https://www.python.org/downloads/release/python-390/)
[![Python 3.10](https://img.shields.io/badge/python-3.10-blue.svg)](https://www.python.org/downloads/release/python-310/)
![Python versions](/assets/readme/python_versions.svg)
[![test](https://github.com/Andrei-Aksionov/nanoGPTplus/actions/workflows/test.yaml/badge.svg)](https://github.com/Andrei-Aksionov/nanoGPTplus/actions/workflows/test.yaml)

***

In this repository I want to rewrite the code for `nanoGPT` presented by Andrej Karpathy in [this video](https://www.youtube.com/watch?v=kCc8FmEb1nY). The original code is in a state that is suitable for rapid prototyping, while the code in this repository in my opinion is more mature (with docstrings, comments of what is exactly going on, readme for architectures, ...) hence the name - nanoGPT+ (you can read it as a very-very small plus :laughing:)
In this repository I want to rewrite the code for `nanoGPT` presented by Andrej Karpathy in [this video](https://www.youtube.com/watch?v=kCc8FmEb1nY). The original code is in a state that is suitable for rapid prototyping, while the code in this repository in my opinion is more mature (with docstrings, comments of what is exactly going on, readme for architectures, key-value cache and Low Ranking Adaptation (LoRA) implementations, ...) hence the name - nanoGPT+ (you can read it as a very-very small plus :laughing:)

The purpose of it is to better understand how Transformer architecture works by actually writing code and, if possible, making it better (or at least to make it work with as few issues as possible).

> **Note**: while the code in this repository reflects almost all the logic of the original one, because of lack of access to GPU (or moreover to a multiple GPUs/nodes with multiple GPUs) I haven't added GPU specific code, so if you have one (GPU or even a node) then you should look at the [original repo](https://github.com/karpathy/nanoGPT).
<p align=center><img src="references/readme/amazon_prime.jpg"></p>
<p align=center><img src="assets/readme/amazon_prime.jpg"></p>

# Project structure

Expand Down
File renamed without changes
23 changes: 23 additions & 0 deletions assets/readme/python_versions.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes
18 changes: 18 additions & 0 deletions src/config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,12 @@ model:
# lr schedular
warmup_iters: null
lr_decay_iters: null
# PEFT
# Note: for training from scratch it should be set as false
use_lora: false
lora_rank: 2
lora_alpha: 3
lora_dropout: 0.0
grad_accumulation_steps: 5
epochs: 1
tqdm_update_interval: 1
Expand All @@ -85,6 +91,12 @@ model:
# lr schedular
warmup_iters: null
lr_decay_iters: null
# PEFT
# Note: for training from scratch it should be set as false
use_lora: false
lora_rank: 2
lora_alpha: 3
lora_dropout: 0.0
grad_accumulation_steps: null
epochs: 1
tqdm_update_interval: 1
Expand All @@ -110,6 +122,12 @@ model:
# lr schedular
warmup_iters: null
lr_decay_iters: null
# PEFT
# Note: for training from scratch it should be set as false
use_lora: false
lora_rank: 2
lora_alpha: 3
lora_dropout: 0.0
grad_accumulation_steps: null
epochs: 1
tqdm_update_interval: 10
Expand Down
6 changes: 3 additions & 3 deletions src/model/gpt_language_model/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Notes about Transformer architecture

![transformer architecture](../../../references/transformer/transformer_architecture.png)
![transformer architecture](../../../assets/transformer/transformer_architecture.png)

pic 1: Transformer architecture[^1] (encoder on the left and decoder is on the right side).

Expand All @@ -11,7 +11,7 @@ Decoder consists of transformer blocks and each transformer block consists of tw
1. Self-attention layer.
2. Feed-forward layer.

![transformer block](../../../references/transformer/transformer_block.png)
![transformer block](../../../assets/transformer/transformer_block.png)

pic 2: Transformer block[^1].

Expand Down Expand Up @@ -73,4 +73,4 @@ In order to build decoder one needs to have:
4. Final head fully-connected layer to transform final token embeddings into predictions.

[^1]: [Illustrated transformer](https://jalammar.github.io/illustrated-transformer/)
[^2]:[Andrej Karpaty's nanoGPT Google Colab](<https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-?usp=sharing#scrollTo=h5hjCcLDr2WC>)
[^2]:[Andrej Karpathy's nanoGPT Google Colab](<https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-?usp=sharing#scrollTo=h5hjCcLDr2WC>)
8 changes: 8 additions & 0 deletions src/model/gpt_language_model/attention.py
Original file line number Diff line number Diff line change
Expand Up @@ -406,6 +406,14 @@ def forward(self, x: Tensor, kv_cache: Optional[Tensor]) -> Tensor:
# [0.1, 0.2, 0.3] -> [0.1, 0.2, 0.3]
# and after softmax `-inf` becomes 0
# this doesn't allow current token communicate with future ones

# TODO: do I correctly apply masking
# w : (batch, head, q_seq_length, kv_seq_length) # noqa: ERA001
# w = torch.matmul(q, k) # noqa: ERA001
# if self.scale:
# w = w / math.sqrt(v.size(-1)) # noqa: ERA001
# nd, ns = w.size(-2), w.size(-1): ERA001
# b = self.bias[:, :, ns-nd:ns, :ns]: ERA001
attention_scores = attention_scores.masked_fill(self.tril[:T, :T] == 0, float("-inf")) # (B, nh, T, T)

# since we want to do weighted averaging we need to transform attention scores into range [0, 1]
Expand Down
6 changes: 5 additions & 1 deletion src/model/gpt_language_model/gpt.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from torch import Tensor, nn
from tqdm import trange

from src.model.gpt_language_model.peft.lora import MergedLinear
from src.model.gpt_language_model.transformer_block import LayerNorm, TransformerBlock
from src.utils import log_error

Expand Down Expand Up @@ -118,6 +119,7 @@ def __init__(
)

def __get_parameters_number(self, exclude_positional_embeddings: bool = True) -> int:
# TODO: print total number of parameters and number of learnable parameters
"""Return total number of parameters of the model without counting parameters of positional embeddings."""
params_count = sum(param.numel() for param in self.parameters())
if exclude_positional_embeddings:
Expand All @@ -139,6 +141,8 @@ def __init_weights(self, module: torch.nn.modules) -> None:
module of the network
"""
if isinstance(module, (nn.Embedding, nn.Linear)):
# TODO: check different std init works better
# 0.2 / sqrt ( 2 * number of transformer blocks)
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
if hasattr(module, "bias") and module.bias is not None:
torch.nn.init.zeros_(module.bias)
Expand Down Expand Up @@ -237,7 +241,7 @@ def __optimizer_parameters(self, weight_decay: float) -> Tuple[dict, dict]:
"""
# separate out all parameters to those that will and won't experience regularizing weight decay
decay, no_decay = set(), set()
expected_weight_modules = (nn.Linear, nn.LayerNorm, LayerNorm, nn.Embedding)
expected_weight_modules = (nn.Linear, nn.LayerNorm, LayerNorm, nn.Embedding, MergedLinear)
for pn, _ in self.named_parameters():
# get the parent module by the parameter's name
module = reduce(lambda module, key: getattr(module, key), pn.split(".")[:-1], self)
Expand Down
41 changes: 41 additions & 0 deletions src/model/gpt_language_model/peft/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Parameter-Efficient Finetuning for LLMs

## 1. Low Ranking Adaptation (LoRA)

![LoRA](https://lightningaidev.wpengine.com/wp-content/uploads/2023/04/lora-3-1024x742.png)
pic 1: LoRA architecture[^1]

Low ranking adaptation is a technique for parameter-efficient finetuning of large language models. Instead of updating all the weights during finetuning (which is very compute intensive) we can learn a separate matrix that will store updates of pretrained weights and in order to reduce number of computations we as well decompose this weight update matrix into two matrices of a lower rank.

As an example: we can have pretrained weights of shape (d, d) and two matrices A and B of shape (d, r) and (r, d) respectively; pretreined weights will not be changed during training, only matrices A and B.
As it can be seen in the scheme above we apply pretrained weights on input, also apply separate weight update matrix formed by A@B matrix multiplication and then do summation.

If d=100 and r=1, then instead of updating 100x100=10_000 parameters (for pretrained weights) we will update only 100x1 (for matrix A) and 1x100 (for matrix B) which makes it in total 200 paremeters to update instead of 10k.

More about it you can read in:

1. Microsoft's LoRA
- [paper](https://arxiv.org/pdf/2106.09685.pdf)
- [repo](https://github.com/microsoft/LoRA/)
2. Lighting.ai
- [blogpost](https://lightning.ai/pages/community/tutorial/lora-llm/)
- [repo](https://github.com/Lightning-AI/lit-llama)

### -- About implementation --

If one take a look at the Microsoft's LoRA repo then can notice that there are two classes: `Linear` and `MergedLinear` and it can be confusing in the beginning.

Basically there are two approaches of calculating query, key and value matrices:

In the basic implementation of attention mechanism you have three separate weight matrices for query, key and
value and thus in order to obtain q, k, v you apply these 3 matrices separately on input x (in self-attention).
This approach is covered with LoRA.Linear class. It's not implemented in this repository, but one can find implementation [here](https://github.com/microsoft/LoRA/blob/main/loralib/layers.py#L91).

The other approach is to have a single matrix that stores weights for all three matrices (query, key and value).
So you can apply this big combined matrix ones (helps with parallelization on GPU) and then you can split the
output into three chunks to have queries, keys and values. To cover this approach MergedLinear class is created.

Note: examples of calculating qkv matrices with separate multiplications and a single one you can find in
`attention.py` files of this repository.

[^1]: [Parameter-Efficient LLM Finetuning With Low-Rank Adaptation (LoRA) by Lightning.ai](https://lightning.ai/pages/community/tutorial/lora-llm/)
Loading

0 comments on commit fb208fb

Please sign in to comment.