Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[luci/pass] Add basic quantization support for weights in GPTQuantizeWeightsWithGPTQPass #14475

Closed
wants to merge 2 commits into from

Conversation

01000-you
Copy link
Contributor

This commit implements basic quantization of weights in QuantizeDequantizeWeightsWithGPTQPass, supporting both 4-bit and 8-bit quantization. Only channel-wise quantization is supported.

ONE-DCO-1.0-Signed-off-by: y01000.you [email protected]

…uantizeWeightsWithGPTQPass

This commit implements basic quantization of weights in `QuantizeDequantizeWeightsWithGPTQPass`, supporting both 4-bit and 8-bit quantization. Only channel-wise quantization is supported.

ONE-DCO-1.0-Signed-off-by: y01000.you <[email protected]>
nudged_max[i]);
}

quantize = [&](uint32_t *indices, loco::TensorShape &dimension, int index_channel_dim) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
quantize = [&](uint32_t *indices, loco::TensorShape &dimension, int index_channel_dim) {
auto quantize = [&](uint32_t *indices, loco::TensorShape &dimension, int index_channel_dim) {

Comment on lines 196 to 197
IterFunc quantize;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
IterFunc quantize;

Comment on lines 128 to 129
assert(min <= max);
const int32_t kMinScale = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert(min <= max);
const int32_t kMinScale = 0;
assert(min <= max);
const int32_t kMinScale = 0;

Comment on lines +75 to +76

assert(scaling_factor[idx_channel] > 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert(scaling_factor[idx_channel] > 0);
assert(scaling_factor[idx_channel] > 0);

Comment on lines 79 to 80
auto data_clipped = data < min[idx_channel] ? min[idx_channel] : data;
data_clipped = data_clipped > max[idx_channel] ? max[idx_channel] : data_clipped;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may not use min(max(a,b),c) form ?

@seanshpark
Copy link
Contributor

1/ plz add link to draft
2/ if there are any unit tests, it would be better to add them
3/ make PRs with small amounts of changes, line one function with several unit tests with negative

}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this format change to another PR

Comment on lines 106 to 107
min[idx_channel] = data < min[idx_channel] ? data : min[idx_channel];
max[idx_channel] = data > max[idx_channel] ? data : max[idx_channel];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use std::min, std::max ???

This commit refactors the QuantizeDequantizeWeightsWithGPTQPass.cpp file to improve its readability and maintainability.

ONE-DCO-1.0-Signed-off-by: y01000.you <[email protected]>
@01000-you 01000-you closed this Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants