From 2f9caccf6407eb555ad97b8781287c7e6d879f03 Mon Sep 17 00:00:00 2001 From: Michael <41590425+michaelshekasta@users.noreply.github.com> Date: Sun, 12 Jan 2025 16:41:26 +0200 Subject: [PATCH 01/12] Create 8 - quantization --- 8 - quantization | 1 + 1 file changed, 1 insertion(+) create mode 100644 8 - quantization diff --git a/8 - quantization b/8 - quantization new file mode 100644 index 00000000..8b137891 --- /dev/null +++ b/8 - quantization @@ -0,0 +1 @@ + From 43c62728f93c55ad160dc22b65d9852f5afb0087 Mon Sep 17 00:00:00 2001 From: Michael <41590425+michaelshekasta@users.noreply.github.com> Date: Sun, 12 Jan 2025 17:01:17 +0200 Subject: [PATCH 02/12] Delete 8 - quantization --- 8 - quantization | 1 - 1 file changed, 1 deletion(-) delete mode 100644 8 - quantization diff --git a/8 - quantization b/8 - quantization deleted file mode 100644 index 8b137891..00000000 --- a/8 - quantization +++ /dev/null @@ -1 +0,0 @@ - From dfd78404258682aa952cd9d031cefc9e5bb5e44f Mon Sep 17 00:00:00 2001 From: Michael <41590425+michaelshekasta@users.noreply.github.com> Date: Sun, 12 Jan 2025 17:02:18 +0200 Subject: [PATCH 03/12] Quantization draft --- 8 - Quantization/readme.md | 37 +++++++++++++++++++++++++++++++++++++ 1 file changed, 37 insertions(+) create mode 100644 8 - Quantization/readme.md diff --git a/8 - Quantization/readme.md b/8 - Quantization/readme.md new file mode 100644 index 00000000..73096d0f --- /dev/null +++ b/8 - Quantization/readme.md @@ -0,0 +1,37 @@ +# Quantization + +This module will guide you through optimizing language models for efficient inference on CPUs, without the need for heavy GPUs. +We’ll cover quantization, a technique that reduces model size and improves inference speed, and introduce GGUF (a format for optimized models). +Additionally, we’ll explore how to perform inference on Intel and MLX (machine learning accelerators) CPUs, demonstrating how to leverage local resources for efficient and cost-effective model deployment. + +## Quantization + +TBD +Motivation? less memory less accuracy? comparing the results? Int4, Int8, bf16? + +## GGUF format + +TBD +using huggingface to run diff quantization models +ollama and llm.cpp? + +## CPU Inference (Intel & MLX) + +TBD +use mlx for inference +use intel for inference (ipex? openvino?) + +## Exercise Notebooks + +| Title | Description | Exercise | Link | Colab | +|-------|-------------|----------|------|-------| +| Quantization | Description| Exercise| [link](./notebooks/example.ipynb) | Open In Colab | +| GGUF format | Description| Exercise| [link](./notebooks/example.ipynb) | Open In Colab | +| CPU Inference (Intel & MLX) | Description| Exercise| [link](./notebooks/example.ipynb) | Open In Colab | + +## References + +- [A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes](https://huggingface.co/blog/hf-bitsandbytes-integration) +- [GGUF Docs](https://huggingface.co/docs/hub/gguf) +- [Mlx Docs](https://huggingface.co/docs/hub/mlx) +- [Intel IPEX](https://huggingface.co/docs/accelerate/usage_guides/ipex) From 7a84cc1e2963b8d92324a3107dcd6801f03d89cf Mon Sep 17 00:00:00 2001 From: burtenshaw Date: Thu, 16 Jan 2025 10:42:38 +0100 Subject: [PATCH 04/12] make directory naming consistent --- {8 - Quantization => 8_Quantization}/readme.md | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename {8 - Quantization => 8_Quantization}/readme.md (100%) diff --git a/8 - Quantization/readme.md b/8_Quantization/readme.md similarity index 100% rename from 8 - Quantization/readme.md rename to 8_Quantization/readme.md From 00c027f76d1d12d7503dd7a27b1df40a67eb1302 Mon Sep 17 00:00:00 2001 From: burtenshaw Date: Thu, 16 Jan 2025 10:44:38 +0100 Subject: [PATCH 05/12] update structure in readme --- .../notebooks/chat_templates_example.ipynb | 2 +- 8_Quantization/readme.md | 24 +++++++------------ 2 files changed, 9 insertions(+), 17 deletions(-) diff --git a/1_instruction_tuning/notebooks/chat_templates_example.ipynb b/1_instruction_tuning/notebooks/chat_templates_example.ipynb index 93772206..ea07a5b0 100644 --- a/1_instruction_tuning/notebooks/chat_templates_example.ipynb +++ b/1_instruction_tuning/notebooks/chat_templates_example.ipynb @@ -584,7 +584,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.15" + "version": "3.11.10" }, "widgets": { "application/vnd.jupyter.widget-state+json": { diff --git a/8_Quantization/readme.md b/8_Quantization/readme.md index 73096d0f..703c6d7f 100644 --- a/8_Quantization/readme.md +++ b/8_Quantization/readme.md @@ -1,33 +1,25 @@ # Quantization -This module will guide you through optimizing language models for efficient inference on CPUs, without the need for heavy GPUs. -We’ll cover quantization, a technique that reduces model size and improves inference speed, and introduce GGUF (a format for optimized models). -Additionally, we’ll explore how to perform inference on Intel and MLX (machine learning accelerators) CPUs, demonstrating how to leverage local resources for efficient and cost-effective model deployment. +This module will guide you through the concenpt of quantization which is useful for optimizing language models for efficient inference on CPUs, without the need for GPUs. We’ll focus on quantization for inference, a technique that reduces model size to improve inference speed. Additionally, we’ll explore how to perform inference on Intel and MLX (machine learning accelerators) CPUs, demonstrating how to leverage local resources for efficient and cost-effective model deployment. -## Quantization +## Quantization Fundementals -TBD -Motivation? less memory less accuracy? comparing the results? Int4, Int8, bf16? +First we will introduce quantization and explain how it reduces model size. -## GGUF format +## The GGUF format -TBD -using huggingface to run diff quantization models -ollama and llm.cpp? +Second, we will introduce the GGUF format and the LlamaCPP package. We will explain how to quantize pre-trained or finetuned models, and how to use them for optimized inference with LlamaCPP. ## CPU Inference (Intel & MLX) -TBD -use mlx for inference -use intel for inference (ipex? openvino?) +Finally, we will explore how to perform inference on Intel and MLX (machine learning accelerators for MacOS) CPUs, demonstrating how to leverage local resources for efficient and cost-effective model deployment. ## Exercise Notebooks | Title | Description | Exercise | Link | Colab | |-------|-------------|----------|------|-------| -| Quantization | Description| Exercise| [link](./notebooks/example.ipynb) | Open In Colab | -| GGUF format | Description| Exercise| [link](./notebooks/example.ipynb) | Open In Colab | -| CPU Inference (Intel & MLX) | Description| Exercise| [link](./notebooks/example.ipynb) | Open In Colab | +| Quantization with LlamaCPP | Description| Exercise| [link](./notebooks/example.ipynb) | Open In Colab | +| CPU Inference (Intel or MLX) | Description| Exercise| [link](./notebooks/example.ipynb) | Open In Colab | ## References From 59f3045d3ae74d6f236d66970cac8e11aa7fa144 Mon Sep 17 00:00:00 2001 From: burtenshaw Date: Thu, 16 Jan 2025 11:18:06 +0100 Subject: [PATCH 06/12] update readme with structure --- 8_Quantization/readme.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/8_Quantization/readme.md b/8_Quantization/readme.md index 703c6d7f..30e03eaa 100644 --- a/8_Quantization/readme.md +++ b/8_Quantization/readme.md @@ -1,18 +1,18 @@ # Quantization -This module will guide you through the concenpt of quantization which is useful for optimizing language models for efficient inference on CPUs, without the need for GPUs. We’ll focus on quantization for inference, a technique that reduces model size to improve inference speed. Additionally, we’ll explore how to perform inference on Intel and MLX (machine learning accelerators) CPUs, demonstrating how to leverage local resources for efficient and cost-effective model deployment. +This module will guide you through the concept of quantization which is useful for optimizing language models for efficient inference on CPUs, without the need for GPUs. We’ll focus on quantization for inference, a technique that reduces model size to improve inference speed. Additionally, we’ll explore how to perform inference on Intel and MLX (machine learning accelerators) CPUs, demonstrating how to leverage local resources for efficient and cost-effective model deployment. -## Quantization Fundementals +## Quantization Fundamentals -First we will introduce quantization and explain how it reduces model size. +First we will introduce quantization and explain how it reduces model size. Check out the [Fundamentals](./fundamentals.md) page for more information. ## The GGUF format -Second, we will introduce the GGUF format and the LlamaCPP package. We will explain how to quantize pre-trained or finetuned models, and how to use them for optimized inference with LlamaCPP. +Second, we will introduce the GGUF format and the LlamaCPP package. We will explain how to quantize pre-trained or finetuned models, and how to use them for optimized inference with LlamaCPP. Check out the [GGUF](./gguf.md) page for more information. ## CPU Inference (Intel & MLX) -Finally, we will explore how to perform inference on Intel and MLX (machine learning accelerators for MacOS) CPUs, demonstrating how to leverage local resources for efficient and cost-effective model deployment. +Finally, we will explore how to perform inference on Intel and MLX (machine learning accelerators for MacOS) CPUs, demonstrating how to leverage local resources for efficient and cost-effective model deployment. Check out the [CPU Inference](./cpu.md) page for more information. ## Exercise Notebooks From 12fc3fc6535504801cefcda2e2e2486fac2c9e9b Mon Sep 17 00:00:00 2001 From: burtenshaw Date: Thu, 16 Jan 2025 11:18:16 +0100 Subject: [PATCH 07/12] update sub pages with structure --- 8_Quantization/cpu.md | 9 +++++++++ 8_Quantization/fundamentals.md | 11 +++++++++++ 8_Quantization/gguf.md | 11 +++++++++++ 3 files changed, 31 insertions(+) create mode 100644 8_Quantization/cpu.md create mode 100644 8_Quantization/fundamentals.md create mode 100644 8_Quantization/gguf.md diff --git a/8_Quantization/cpu.md b/8_Quantization/cpu.md new file mode 100644 index 00000000..aed0ad6c --- /dev/null +++ b/8_Quantization/cpu.md @@ -0,0 +1,9 @@ +# Inference on CPUs + +## Intel CPUs + +## MLX CPUs + +## Exercise Notebooks + +## References diff --git a/8_Quantization/fundamentals.md b/8_Quantization/fundamentals.md new file mode 100644 index 00000000..70b33515 --- /dev/null +++ b/8_Quantization/fundamentals.md @@ -0,0 +1,11 @@ +# Quantization Fundamentals + +## What is Quantization? + +## Quantization Techniques + +## Quantization for Inference + +## Exercise Notebooks + +## References diff --git a/8_Quantization/gguf.md b/8_Quantization/gguf.md new file mode 100644 index 00000000..95a5a960 --- /dev/null +++ b/8_Quantization/gguf.md @@ -0,0 +1,11 @@ +# The GGUF format + +## LlamaCPP + +## Introduction to GGUF + +## Quantizing Models with GGUF + +## Exercise Notebooks + +## References From 7951134a94c37cc581a21def3db94decd13b1f7a Mon Sep 17 00:00:00 2001 From: Michael <41590425+michaelshekasta@users.noreply.github.com> Date: Thu, 16 Jan 2025 14:29:50 +0200 Subject: [PATCH 08/12] Update fundamentals.md --- 8_Quantization/fundamentals.md | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/8_Quantization/fundamentals.md b/8_Quantization/fundamentals.md index 70b33515..f0279944 100644 --- a/8_Quantization/fundamentals.md +++ b/8_Quantization/fundamentals.md @@ -1,11 +1,20 @@ # Quantization Fundamentals ## What is Quantization? +Quantization is a technique used to reduce memory and computational costs by representing model weights and activations with lower-precision data types, such as 8-bit integers (int8). By doing so, it allows larger models to fit into memory and speeds up inference, making the model more efficient without significantly sacrificing performance. ## Quantization Techniques +We should focus on GPTQ? ## Quantization for Inference + ## Exercise Notebooks ## References +https://huggingface.co/docs/transformers/main_classes/quantization +https://huggingface.co/docs/transformers/v4.48.0/quantization/overview +https://huggingface.co/docs/optimum/en/concept_guides/quantization +https://huggingface.co/blog/introduction-to-ggml +https://huggingface.co/docs/hub/gguf +https://huggingface.co/docs/transformers/gguf From d9723eafbb3c5519e50007967b80886c6a94e825 Mon Sep 17 00:00:00 2001 From: Michael <41590425+michaelshekasta@users.noreply.github.com> Date: Wed, 22 Jan 2025 00:59:21 +0200 Subject: [PATCH 09/12] Update fundamentals.md --- 8_Quantization/fundamentals.md | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/8_Quantization/fundamentals.md b/8_Quantization/fundamentals.md index f0279944..9e362e9d 100644 --- a/8_Quantization/fundamentals.md +++ b/8_Quantization/fundamentals.md @@ -3,13 +3,22 @@ ## What is Quantization? Quantization is a technique used to reduce memory and computational costs by representing model weights and activations with lower-precision data types, such as 8-bit integers (int8). By doing so, it allows larger models to fit into memory and speeds up inference, making the model more efficient without significantly sacrificing performance. +* motivation - already wrote +* Floating Point Representation dtypes - float32, float16, bfloat, int8, int 4 +* absmax & zero-point quantization +* handling outliers with float16 + ## Quantization Techniques -We should focus on GPTQ? +* Post-Training Quantization (PTQ) vs Quantization-Aware Training (QAT) ## Quantization for Inference - +Need to look for more resources ## Exercise Notebooks +I'm unsure about what exactly we should include here. Below are a few options, along with my humble thoughts: +* Type casting (float32 to int8): This seems too low-level. +* Reproducing a GPT-2 example from the Maxim blog post: I'm uncertain about the contribution this would make. +* Taking a large model from the Hugging Face Hub and converting it to a quantized model: This might fit better in the section where we discuss GTPQ or other quantization methods. ## References https://huggingface.co/docs/transformers/main_classes/quantization From b6793faf71e27b01df17cb26e582497a2b3b362a Mon Sep 17 00:00:00 2001 From: Michael <41590425+michaelshekasta@users.noreply.github.com> Date: Wed, 22 Jan 2025 00:59:56 +0200 Subject: [PATCH 10/12] Update fundamentals.md typo --- 8_Quantization/fundamentals.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/8_Quantization/fundamentals.md b/8_Quantization/fundamentals.md index 9e362e9d..79275773 100644 --- a/8_Quantization/fundamentals.md +++ b/8_Quantization/fundamentals.md @@ -17,7 +17,7 @@ Need to look for more resources ## Exercise Notebooks I'm unsure about what exactly we should include here. Below are a few options, along with my humble thoughts: * Type casting (float32 to int8): This seems too low-level. -* Reproducing a GPT-2 example from the Maxim blog post: I'm uncertain about the contribution this would make. +* Reproducing a GPT-2 example from the Maxime blog post: I'm uncertain about the contribution this would make. * Taking a large model from the Hugging Face Hub and converting it to a quantized model: This might fit better in the section where we discuss GTPQ or other quantization methods. ## References From 1b94c08a86ecc0a1b57ec4a155247067fa54e89f Mon Sep 17 00:00:00 2001 From: Michael <41590425+michaelshekasta@users.noreply.github.com> Date: Wed, 22 Jan 2025 01:01:00 +0200 Subject: [PATCH 11/12] Update fundamentals.md --- 8_Quantization/fundamentals.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/8_Quantization/fundamentals.md b/8_Quantization/fundamentals.md index 79275773..ea23fba6 100644 --- a/8_Quantization/fundamentals.md +++ b/8_Quantization/fundamentals.md @@ -9,7 +9,7 @@ Quantization is a technique used to reduce memory and computational costs by rep * handling outliers with float16 ## Quantization Techniques -* Post-Training Quantization (PTQ) vs Quantization-Aware Training (QAT) +* Post-Training Quantization (PTQ) vs Quantization-Aware Training (QAT) - focus on PTQ ## Quantization for Inference Need to look for more resources From 021bbf07cd5cb3e9958456c6009bf927ffc3c154 Mon Sep 17 00:00:00 2001 From: Michael <41590425+michaelshekasta@users.noreply.github.com> Date: Wed, 22 Jan 2025 01:02:18 +0200 Subject: [PATCH 12/12] Update fundamentals.md --- 8_Quantization/fundamentals.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/8_Quantization/fundamentals.md b/8_Quantization/fundamentals.md index ea23fba6..3a879c88 100644 --- a/8_Quantization/fundamentals.md +++ b/8_Quantization/fundamentals.md @@ -20,6 +20,9 @@ I'm unsure about what exactly we should include here. Below are a few options, a * Reproducing a GPT-2 example from the Maxime blog post: I'm uncertain about the contribution this would make. * Taking a large model from the Hugging Face Hub and converting it to a quantized model: This might fit better in the section where we discuss GTPQ or other quantization methods. +## Open Questions +Where should we talk about "quantization method" like gptq? + ## References https://huggingface.co/docs/transformers/main_classes/quantization https://huggingface.co/docs/transformers/v4.48.0/quantization/overview