LoRA (#16)

Low Ranking Adaptation implementation
Andrei-Aksionov · May 10, 2023 · fb208fb · fb208fb
1 parent f3daa68
commit fb208fb
Show file tree

Hide file tree

Showing 15 changed files with 703 additions and 17 deletions.
diff --git a/README.md b/README.md
@@ -1,23 +1,22 @@
 
 <p>
-    <h2 align="center">Welcome to NanoGPT+ in PyTorch</h2>
-    <h5 align="center">Knock-off edition (but with enchantments)<h5>
+    <h2 align="center">Welcome to NanoGPT+</h2>
+    <h4 align="center">Knock-off edition</h4>
+    <h6 align="center">but with enchantments</h6>
 </p>
 
-[![Python 3.8](https://img.shields.io/badge/python-3.8-blue.svg)](https://www.python.org/downloads/release/python-380/)
-[![Python 3.9](https://img.shields.io/badge/python-3.9-blue.svg)](https://www.python.org/downloads/release/python-390/)
-[![Python 3.10](https://img.shields.io/badge/python-3.10-blue.svg)](https://www.python.org/downloads/release/python-310/)
+![Python versions](/assets/readme/python_versions.svg)
 [![test](https://github.com/Andrei-Aksionov/nanoGPTplus/actions/workflows/test.yaml/badge.svg)](https://github.com/Andrei-Aksionov/nanoGPTplus/actions/workflows/test.yaml)
 
 ***
 
-In this repository I want to rewrite the code for `nanoGPT` presented by Andrej Karpathy in [this video](https://www.youtube.com/watch?v=kCc8FmEb1nY). The original code is in a state that is suitable for rapid prototyping, while the code in this repository in my opinion is more mature (with docstrings, comments of what is exactly going on, readme for architectures, ...) hence the name - nanoGPT+ (you can read it as a very-very small plus :laughing:)
+In this repository I want to rewrite the code for `nanoGPT` presented by Andrej Karpathy in [this video](https://www.youtube.com/watch?v=kCc8FmEb1nY). The original code is in a state that is suitable for rapid prototyping, while the code in this repository in my opinion is more mature (with docstrings, comments of what is exactly going on, readme for architectures, key-value cache and Low Ranking Adaptation (LoRA) implementations, ...) hence the name - nanoGPT+ (you can read it as a very-very small plus :laughing:)
 
 The purpose of it is to better understand how Transformer architecture works by actually writing code and, if possible, making it better (or at least to make it work with as few issues as possible).
 
 > **Note**: while the code in this repository reflects almost all the logic of the original one, because of lack of access to GPU (or moreover to a multiple GPUs/nodes with multiple GPUs) I haven't added GPU specific code, so if you have one (GPU or even a node) then you should look at the [original repo](https://github.com/karpathy/nanoGPT).
 
-<p align=center><img src="references/readme/amazon_prime.jpg"></p>
+<p align=center><img src="assets/readme/amazon_prime.jpg"></p>
 
 # Project structure
 

diff --git a/references/readme/amazon_prime.jpg → assets/readme/amazon_prime.jpg b/references/readme/amazon_prime.jpg → assets/readme/amazon_prime.jpg
diff --git a/assets/readme/python_versions.svg b/assets/readme/python_versions.svg
diff --git a/.../transformer/transformer_architecture.png → .../transformer/transformer_architecture.png b/.../transformer/transformer_architecture.png → .../transformer/transformer_architecture.png
diff --git a/references/transformer/transformer_block.png → assets/transformer/transformer_block.png b/references/transformer/transformer_block.png → assets/transformer/transformer_block.png
diff --git a/src/config/config.yaml b/src/config/config.yaml
@@ -60,6 +60,12 @@ model:
         # lr schedular
         warmup_iters: null
         lr_decay_iters: null
+        # PEFT
+        # Note: for training from scratch it should be set as false
+        use_lora: false
+        lora_rank: 2
+        lora_alpha: 3
+        lora_dropout: 0.0
         grad_accumulation_steps: 5
         epochs: 1
         tqdm_update_interval: 1
@@ -85,6 +91,12 @@ model:
         # lr schedular
         warmup_iters: null
         lr_decay_iters: null
+        # PEFT
+        # Note: for training from scratch it should be set as false
+        use_lora: false
+        lora_rank: 2
+        lora_alpha: 3
+        lora_dropout: 0.0
         grad_accumulation_steps: null
         epochs: 1
         tqdm_update_interval: 1
@@ -110,6 +122,12 @@ model:
         # lr schedular
         warmup_iters: null
         lr_decay_iters: null
+        # PEFT
+        # Note: for training from scratch it should be set as false
+        use_lora: false
+        lora_rank: 2
+        lora_alpha: 3
+        lora_dropout: 0.0
         grad_accumulation_steps: null
         epochs: 1
         tqdm_update_interval: 10

diff --git a/src/model/gpt_language_model/README.md b/src/model/gpt_language_model/README.md
@@ -1,6 +1,6 @@
 # Notes about Transformer architecture
 
-![transformer architecture](../../../references/transformer/transformer_architecture.png)
+![transformer architecture](../../../assets/transformer/transformer_architecture.png)
 
 pic 1: Transformer architecture[^1] (encoder on the left and decoder is on the right side).
 
@@ -11,7 +11,7 @@ Decoder consists of transformer blocks and each transformer block consists of tw
 1. Self-attention layer.
 2. Feed-forward layer.
 
-![transformer block](../../../references/transformer/transformer_block.png)
+![transformer block](../../../assets/transformer/transformer_block.png)
 
 pic 2: Transformer block[^1].
 
@@ -73,4 +73,4 @@ In order to build decoder one needs to have:
 4. Final head fully-connected layer to transform final token embeddings into predictions.
 
 [^1]: [Illustrated transformer](https://jalammar.github.io/illustrated-transformer/)
-[^2]:[Andrej Karpaty's nanoGPT Google Colab](<https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-?usp=sharing#scrollTo=h5hjCcLDr2WC>)
+[^2]:[Andrej Karpathy's nanoGPT Google Colab](<https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-?usp=sharing#scrollTo=h5hjCcLDr2WC>)
diff --git a/src/model/gpt_language_model/attention.py b/src/model/gpt_language_model/attention.py
@@ -406,6 +406,14 @@ def forward(self, x: Tensor, kv_cache: Optional[Tensor]) -> Tensor:
             # [0.1, 0.2, 0.3]  -> [0.1, 0.2, 0.3]
             # and after softmax `-inf` becomes 0
             # this doesn't allow current token communicate with future ones
+
+            # TODO: do I correctly apply masking
+            # w : (batch, head, q_seq_length, kv_seq_length)   # noqa: ERA001
+            # w = torch.matmul(q, k) # noqa: ERA001
+            # if self.scale:
+            #     w = w / math.sqrt(v.size(-1))  # noqa: ERA001
+            # nd, ns = w.size(-2), w.size(-1): ERA001
+            # b = self.bias[:, :, ns-nd:ns, :ns]: ERA001
             attention_scores = attention_scores.masked_fill(self.tril[:T, :T] == 0, float("-inf"))  # (B, nh, T, T)
 
         # since we want to do weighted averaging we need to transform attention scores into range [0, 1]

diff --git a/src/model/gpt_language_model/gpt.py b/src/model/gpt_language_model/gpt.py
@@ -8,6 +8,7 @@
 from torch import Tensor, nn
 from tqdm import trange
 
+from src.model.gpt_language_model.peft.lora import MergedLinear
 from src.model.gpt_language_model.transformer_block import LayerNorm, TransformerBlock
 from src.utils import log_error
 
@@ -118,6 +119,7 @@ def __init__(
         )
 
     def __get_parameters_number(self, exclude_positional_embeddings: bool = True) -> int:
+        # TODO: print total number of parameters and number of learnable parameters
         """Return total number of parameters of the model without counting parameters of positional embeddings."""
         params_count = sum(param.numel() for param in self.parameters())
         if exclude_positional_embeddings:
@@ -139,6 +141,8 @@ def __init_weights(self, module: torch.nn.modules) -> None:
             module of the network
         """
         if isinstance(module, (nn.Embedding, nn.Linear)):
+            # TODO: check different std init works better
+            #   0.2 / sqrt ( 2 * number of transformer blocks)
             torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
             if hasattr(module, "bias") and module.bias is not None:
                 torch.nn.init.zeros_(module.bias)
@@ -237,7 +241,7 @@ def __optimizer_parameters(self, weight_decay: float) -> Tuple[dict, dict]:
         """
         # separate out all parameters to those that will and won't experience regularizing weight decay
         decay, no_decay = set(), set()
-        expected_weight_modules = (nn.Linear, nn.LayerNorm, LayerNorm, nn.Embedding)
+        expected_weight_modules = (nn.Linear, nn.LayerNorm, LayerNorm, nn.Embedding, MergedLinear)
         for pn, _ in self.named_parameters():
             # get the parent module by the parameter's name
             module = reduce(lambda module, key: getattr(module, key), pn.split(".")[:-1], self)

diff --git a/src/model/gpt_language_model/peft/README.md b/src/model/gpt_language_model/peft/README.md
@@ -0,0 +1,41 @@
+# Parameter-Efficient Finetuning for LLMs
+
+## 1. Low Ranking Adaptation (LoRA)
+
+![LoRA](https://lightningaidev.wpengine.com/wp-content/uploads/2023/04/lora-3-1024x742.png)
+pic 1: LoRA architecture[^1]
+
+Low ranking adaptation is a technique for parameter-efficient finetuning of large language models. Instead of updating all the weights during finetuning (which is very compute intensive) we can learn a separate matrix that will store updates of pretrained weights and in order to reduce number of computations we as well decompose this weight update matrix into two matrices of a lower rank.
+
+As an example: we can have pretrained weights of shape (d, d) and two matrices A and B of shape (d, r) and (r, d) respectively; pretreined weights will not be changed during training, only matrices A and B.
+As it can be seen in the scheme above we apply pretrained weights on input, also apply separate weight update matrix formed by A@B matrix multiplication and then do summation.
+
+If d=100 and r=1, then instead of updating 100x100=10_000 parameters (for pretrained weights) we will update only 100x1 (for matrix A) and 1x100 (for matrix B) which makes it in total 200 paremeters to update instead of 10k.
+
+More about it you can read in:
+
+1. Microsoft's LoRA
+    - [paper](https://arxiv.org/pdf/2106.09685.pdf)
+    - [repo](https://github.com/microsoft/LoRA/)
+2. Lighting.ai
+    - [blogpost](https://lightning.ai/pages/community/tutorial/lora-llm/)
+    - [repo](https://github.com/Lightning-AI/lit-llama)
+
+### -- About implementation --
+
+If one take a look at the Microsoft's LoRA repo then can notice that there are two classes: `Linear` and `MergedLinear` and it can be confusing in the beginning.
+
+Basically there are two approaches of calculating query, key and value matrices:
+
+In the basic implementation of attention mechanism you have three separate weight matrices for query, key and
+value and thus in order to obtain q, k, v you apply these 3 matrices separately on input x (in self-attention).
+This approach is covered with LoRA.Linear class. It's not implemented in this repository, but one can find implementation [here](https://github.com/microsoft/LoRA/blob/main/loralib/layers.py#L91).
+
+The other approach is to have a single matrix that stores weights for all three matrices (query, key and value).
+So you can apply this big combined matrix ones (helps with parallelization on GPU) and then you can split the
+output into three chunks to have queries, keys and values. To cover this approach MergedLinear class is created.
+
+Note: examples of calculating qkv matrices with separate multiplications and a single one you can find in
+`attention.py` files of this repository.
+
+[^1]: [Parameter-Efficient LLM Finetuning With Low-Rank Adaptation (LoRA) by Lightning.ai](https://lightning.ai/pages/community/tutorial/lora-llm/)