-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MODULE] A module on quantization #169
Draft
michaelshekasta
wants to merge
13
commits into
huggingface:main
Choose a base branch
from
michaelshekasta:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
2f9cacc
Create 8 - quantization
michaelshekasta 43c6272
Delete 8 - quantization
michaelshekasta dfd7840
Quantization draft
michaelshekasta 7a84cc1
make directory naming consistent
burtenshaw 00c027f
update structure in readme
burtenshaw 59f3045
update readme with structure
burtenshaw 12fc3fc
update sub pages with structure
burtenshaw b2a75f2
Merge branch 'huggingface:main' into main
michaelshekasta 7951134
Update fundamentals.md
michaelshekasta d9723ea
Update fundamentals.md
michaelshekasta b6793fa
Update fundamentals.md
michaelshekasta 1b94c08
Update fundamentals.md
michaelshekasta 021bbf0
Update fundamentals.md
michaelshekasta File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# Inference on CPUs | ||
|
||
## Intel CPUs | ||
|
||
## MLX CPUs | ||
|
||
## Exercise Notebooks | ||
|
||
## References |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
# Quantization Fundamentals | ||
|
||
## What is Quantization? | ||
Quantization is a technique used to reduce memory and computational costs by representing model weights and activations with lower-precision data types, such as 8-bit integers (int8). By doing so, it allows larger models to fit into memory and speeds up inference, making the model more efficient without significantly sacrificing performance. | ||
|
||
* motivation - already wrote | ||
* Floating Point Representation dtypes - float32, float16, bfloat, int8, int 4 | ||
* absmax & zero-point quantization | ||
* handling outliers with float16 | ||
|
||
## Quantization Techniques | ||
* Post-Training Quantization (PTQ) vs Quantization-Aware Training (QAT) - focus on PTQ | ||
|
||
## Quantization for Inference | ||
Need to look for more resources | ||
|
||
## Exercise Notebooks | ||
I'm unsure about what exactly we should include here. Below are a few options, along with my humble thoughts: | ||
* Type casting (float32 to int8): This seems too low-level. | ||
* Reproducing a GPT-2 example from the Maxime blog post: I'm uncertain about the contribution this would make. | ||
* Taking a large model from the Hugging Face Hub and converting it to a quantized model: This might fit better in the section where we discuss GTPQ or other quantization methods. | ||
|
||
## Open Questions | ||
Where should we talk about "quantization method" like gptq? | ||
|
||
## References | ||
https://huggingface.co/docs/transformers/main_classes/quantization | ||
https://huggingface.co/docs/transformers/v4.48.0/quantization/overview | ||
https://huggingface.co/docs/optimum/en/concept_guides/quantization | ||
https://huggingface.co/blog/introduction-to-ggml | ||
https://huggingface.co/docs/hub/gguf | ||
https://huggingface.co/docs/transformers/gguf |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
# The GGUF format | ||
|
||
## LlamaCPP | ||
|
||
## Introduction to GGUF | ||
|
||
## Quantizing Models with GGUF | ||
|
||
## Exercise Notebooks | ||
|
||
## References |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
# Quantization | ||
|
||
This module will guide you through the concept of quantization which is useful for optimizing language models for efficient inference on CPUs, without the need for GPUs. We’ll focus on quantization for inference, a technique that reduces model size to improve inference speed. Additionally, we’ll explore how to perform inference on Intel and MLX (machine learning accelerators) CPUs, demonstrating how to leverage local resources for efficient and cost-effective model deployment. | ||
|
||
## Quantization Fundamentals | ||
|
||
First we will introduce quantization and explain how it reduces model size. Check out the [Fundamentals](./fundamentals.md) page for more information. | ||
|
||
## The GGUF format | ||
|
||
Second, we will introduce the GGUF format and the LlamaCPP package. We will explain how to quantize pre-trained or finetuned models, and how to use them for optimized inference with LlamaCPP. Check out the [GGUF](./gguf.md) page for more information. | ||
|
||
## CPU Inference (Intel & MLX) | ||
|
||
Finally, we will explore how to perform inference on Intel and MLX (machine learning accelerators for MacOS) CPUs, demonstrating how to leverage local resources for efficient and cost-effective model deployment. Check out the [CPU Inference](./cpu.md) page for more information. | ||
|
||
## Exercise Notebooks | ||
|
||
| Title | Description | Exercise | Link | Colab | | ||
|-------|-------------|----------|------|-------| | ||
| Quantization with LlamaCPP | Description| Exercise| [link](./notebooks/example.ipynb) | <a target="_blank" href="link"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> | | ||
| CPU Inference (Intel or MLX) | Description| Exercise| [link](./notebooks/example.ipynb) | <a target="_blank" href="link"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This table is sufficient. You can remove the mention of exercise notebooks in sub pages and replace with links. |
||
|
||
## References | ||
|
||
- [A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes](https://huggingface.co/blog/hf-bitsandbytes-integration) | ||
- [GGUF Docs](https://huggingface.co/docs/hub/gguf) | ||
- [Mlx Docs](https://huggingface.co/docs/hub/mlx) | ||
- [Intel IPEX](https://huggingface.co/docs/accelerate/usage_guides/ipex) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look at the other modules you'll see table with example notebooks. In this module we will need 2. One on GGUF and one on CPU inference.