huggingface · michaelshekasta · Jan 12, 2025 · Jan 12, 2025 · Jan 12, 2025 · Jan 16, 2025
diff --git a/1_instruction_tuning/notebooks/chat_templates_example.ipynb b/1_instruction_tuning/notebooks/chat_templates_example.ipynb
@@ -584,7 +584,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.15"
+   "version": "3.11.10"
   },
   "widgets": {
    "application/vnd.jupyter.widget-state+json": {

diff --git a/8_Quantization/cpu.md b/8_Quantization/cpu.md
@@ -0,0 +1,9 @@
+# Inference on CPUs
+
+## Intel CPUs
+
+## MLX CPUs
+
+## Exercise Notebooks
+
+## References
diff --git a/8_Quantization/fundamentals.md b/8_Quantization/fundamentals.md
@@ -0,0 +1,32 @@
+# Quantization Fundamentals
+
+## What is Quantization?
+Quantization is a technique used to reduce memory and computational costs by representing model weights and activations with lower-precision data types, such as 8-bit integers (int8). By doing so, it allows larger models to fit into memory and speeds up inference, making the model more efficient without significantly sacrificing performance.
+
+* motivation - already wrote
+* Floating Point Representation dtypes - float32, float16, bfloat, int8, int 4
+* absmax &  zero-point quantization
+* handling outliers with float16
+
+## Quantization Techniques
+* Post-Training Quantization (PTQ) vs Quantization-Aware Training (QAT) - focus on PTQ
+
+## Quantization for Inference
+Need to look for more resources
+
+## Exercise Notebooks
+I'm unsure about what exactly we should include here. Below are a few options, along with my humble thoughts:
+* Type casting (float32 to int8): This seems too low-level.
+* Reproducing a GPT-2 example from the Maxime blog post: I'm uncertain about the contribution this would make.
+* Taking a large model from the Hugging Face Hub and converting it to a quantized model: This might fit better in the section where we discuss GTPQ or other quantization methods.
+
+## Open Questions
+Where should we talk about "quantization method" like gptq?
+
+## References
+https://huggingface.co/docs/transformers/main_classes/quantization
+https://huggingface.co/docs/transformers/v4.48.0/quantization/overview
+https://huggingface.co/docs/optimum/en/concept_guides/quantization
+https://huggingface.co/blog/introduction-to-ggml
+https://huggingface.co/docs/hub/gguf
+https://huggingface.co/docs/transformers/gguf
diff --git a/8_Quantization/gguf.md b/8_Quantization/gguf.md
@@ -0,0 +1,11 @@
+# The GGUF format
+
+## LlamaCPP
+
+## Introduction to GGUF
+
+## Quantizing Models with GGUF
+
+## Exercise Notebooks
+
+## References
diff --git a/8_Quantization/readme.md b/8_Quantization/readme.md
@@ -0,0 +1,29 @@
+# Quantization
+
+This module will guide you through the concept of quantization which is useful for optimizing language models for efficient inference on CPUs, without the need for GPUs. We’ll focus on quantization for inference, a technique that reduces model size to improve inference speed. Additionally, we’ll explore how to perform inference on Intel and MLX (machine learning accelerators) CPUs, demonstrating how to leverage local resources for efficient and cost-effective model deployment.
+
+## Quantization Fundamentals
+
+First we will introduce quantization and explain how it reduces model size. Check out the [Fundamentals](./fundamentals.md) page for more information.
+
+## The GGUF format
+
+Second, we will introduce the GGUF format and the LlamaCPP package. We will explain how to quantize pre-trained or finetuned models, and how to use them for optimized inference with LlamaCPP. Check out the [GGUF](./gguf.md) page for more information.
+
+## CPU Inference (Intel & MLX)
+
+Finally, we will explore how to perform inference on Intel and MLX (machine learning accelerators for MacOS) CPUs, demonstrating how to leverage local resources for efficient and cost-effective model deployment. Check out the [CPU Inference](./cpu.md) page for more information.
+
+## Exercise Notebooks
+
+| Title | Description | Exercise | Link | Colab |
+|-------|-------------|----------|------|-------|
+| Quantization with LlamaCPP | Description| Exercise| [link](./notebooks/example.ipynb) | <a target="_blank" href="link"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> |
+| CPU Inference (Intel or MLX) | Description| Exercise| [link](./notebooks/example.ipynb) | <a target="_blank" href="link"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> |
+
+## References
+
+- [A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes](https://huggingface.co/blog/hf-bitsandbytes-integration)
+- [GGUF Docs](https://huggingface.co/docs/hub/gguf)
+- [Mlx Docs](https://huggingface.co/docs/hub/mlx)
+- [Intel IPEX](https://huggingface.co/docs/accelerate/usage_guides/ipex)