Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MODULE] A module on quantization #169

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from
Original file line number Diff line number Diff line change
Expand Up @@ -584,7 +584,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.15"
"version": "3.11.10"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
Expand Down
9 changes: 9 additions & 0 deletions 8_Quantization/cpu.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Inference on CPUs

## Intel CPUs

## MLX CPUs

## Exercise Notebooks

## References
32 changes: 32 additions & 0 deletions 8_Quantization/fundamentals.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Quantization Fundamentals

## What is Quantization?
Quantization is a technique used to reduce memory and computational costs by representing model weights and activations with lower-precision data types, such as 8-bit integers (int8). By doing so, it allows larger models to fit into memory and speeds up inference, making the model more efficient without significantly sacrificing performance.

* motivation - already wrote
* Floating Point Representation dtypes - float32, float16, bfloat, int8, int 4
* absmax & zero-point quantization
* handling outliers with float16

## Quantization Techniques
* Post-Training Quantization (PTQ) vs Quantization-Aware Training (QAT) - focus on PTQ

## Quantization for Inference
Need to look for more resources

## Exercise Notebooks
I'm unsure about what exactly we should include here. Below are a few options, along with my humble thoughts:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at the other modules you'll see table with example notebooks. In this module we will need 2. One on GGUF and one on CPU inference.

* Type casting (float32 to int8): This seems too low-level.
* Reproducing a GPT-2 example from the Maxime blog post: I'm uncertain about the contribution this would make.
* Taking a large model from the Hugging Face Hub and converting it to a quantized model: This might fit better in the section where we discuss GTPQ or other quantization methods.

## Open Questions
Where should we talk about "quantization method" like gptq?

## References
https://huggingface.co/docs/transformers/main_classes/quantization
https://huggingface.co/docs/transformers/v4.48.0/quantization/overview
https://huggingface.co/docs/optimum/en/concept_guides/quantization
https://huggingface.co/blog/introduction-to-ggml
https://huggingface.co/docs/hub/gguf
https://huggingface.co/docs/transformers/gguf
11 changes: 11 additions & 0 deletions 8_Quantization/gguf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# The GGUF format

## LlamaCPP

## Introduction to GGUF

## Quantizing Models with GGUF

## Exercise Notebooks

## References
29 changes: 29 additions & 0 deletions 8_Quantization/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Quantization

This module will guide you through the concept of quantization which is useful for optimizing language models for efficient inference on CPUs, without the need for GPUs. We’ll focus on quantization for inference, a technique that reduces model size to improve inference speed. Additionally, we’ll explore how to perform inference on Intel and MLX (machine learning accelerators) CPUs, demonstrating how to leverage local resources for efficient and cost-effective model deployment.

## Quantization Fundamentals

First we will introduce quantization and explain how it reduces model size. Check out the [Fundamentals](./fundamentals.md) page for more information.

## The GGUF format

Second, we will introduce the GGUF format and the LlamaCPP package. We will explain how to quantize pre-trained or finetuned models, and how to use them for optimized inference with LlamaCPP. Check out the [GGUF](./gguf.md) page for more information.

## CPU Inference (Intel & MLX)

Finally, we will explore how to perform inference on Intel and MLX (machine learning accelerators for MacOS) CPUs, demonstrating how to leverage local resources for efficient and cost-effective model deployment. Check out the [CPU Inference](./cpu.md) page for more information.

## Exercise Notebooks

| Title | Description | Exercise | Link | Colab |
|-------|-------------|----------|------|-------|
| Quantization with LlamaCPP | Description| Exercise| [link](./notebooks/example.ipynb) | <a target="_blank" href="link"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> |
| CPU Inference (Intel or MLX) | Description| Exercise| [link](./notebooks/example.ipynb) | <a target="_blank" href="link"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This table is sufficient. You can remove the mention of exercise notebooks in sub pages and replace with links.


## References

- [A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes](https://huggingface.co/blog/hf-bitsandbytes-integration)
- [GGUF Docs](https://huggingface.co/docs/hub/gguf)
- [Mlx Docs](https://huggingface.co/docs/hub/mlx)
- [Intel IPEX](https://huggingface.co/docs/accelerate/usage_guides/ipex)