New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[luci/pass] Add basic quantization support for weights in GPTQuantizeWeightsWithGPTQPass #14475

Closed

01000-you wants to merge 2 commits into Samsung:master from 01000-you:gptq_pass

Contributor

01000-you commented Dec 18, 2024

This commit implements basic quantization of weights in QuantizeDequantizeWeightsWithGPTQPass, supporting both 4-bit and 8-bit quantization. Only channel-wise quantization is supported.

ONE-DCO-1.0-Signed-off-by: y01000.you [email protected]


          [luci/pass] Add basic quantization support for weights in QuantizeDeq…

2d28d7c

…uantizeWeightsWithGPTQPass

This commit implements basic quantization of weights in `QuantizeDequantizeWeightsWithGPTQPass`, supporting both 4-bit and 8-bit quantization. Only channel-wise quantization is supported.

ONE-DCO-1.0-Signed-off-by: y01000.you <[email protected]>

seanshpark reviewed

View reviewed changes

compiler/luci/pass/src/QuantizeDequantizeWeightsWithGPTQPass.cpp Outdated

+                                        nudged_max[i]);
+                }
+                quantize = [&](uint32_t *indices, loco::TensorShape &dimension, int index_channel_dim) {

Contributor

seanshpark Dec 18, 2024

Suggested change

      
              quantize = [&](uint32_t *indices, loco::TensorShape &dimension, int index_channel_dim) {
          
              auto quantize = [&](uint32_t *indices, loco::TensorShape &dimension, int index_channel_dim) {

seanshpark reviewed

View reviewed changes

compiler/luci/pass/src/QuantizeDequantizeWeightsWithGPTQPass.cpp Outdated

Comment on lines 196 to 197

		IterFunc quantize;

Contributor

seanshpark Dec 18, 2024

Suggested change

IterFunc quantize;

seanshpark reviewed

View reviewed changes

compiler/luci/pass/src/QuantizeDequantizeWeightsWithGPTQPass.cpp Outdated

Comment on lines 128 to 129

		assert(min <= max);
		const int32_t kMinScale = 0;

Contributor

seanshpark Dec 18, 2024

Suggested change

      
              assert(min <= max);
          
              const int32_t kMinScale = 0;
          
              assert(min <= max);
          
              const int32_t kMinScale = 0;

seanshpark reviewed

View reviewed changes

compiler/luci/pass/src/QuantizeDequantizeWeightsWithGPTQPass.cpp

Comment on lines +75 to +76


		assert(scaling_factor[idx_channel] > 0);

Contributor

seanshpark Dec 18, 2024

Suggested change

      
              assert(scaling_factor[idx_channel] > 0);
          
              assert(scaling_factor[idx_channel] > 0);

seanshpark reviewed

View reviewed changes

compiler/luci/pass/src/QuantizeDequantizeWeightsWithGPTQPass.cpp Outdated

Comment on lines 79 to 80

		auto data_clipped = data < min[idx_channel] ? min[idx_channel] : data;
		data_clipped = data_clipped > max[idx_channel] ? max[idx_channel] : data_clipped;

Contributor

seanshpark Dec 18, 2024

may not use min(max(a,b),c) form ?

Contributor

seanshpark commented Dec 18, 2024

1/ plz add link to draft
2/ if there are any unit tests, it would be better to add them
3/ make PRs with small amounts of changes, line one function with several unit tests with negative

seanshpark reviewed

View reviewed changes

compiler/luci/pass/src/QuantizeDequantizeWeightsWithGPTQPass.cpp

		}

Contributor

seanshpark Dec 18, 2024

move this format change to another PR

seanshpark reviewed

View reviewed changes

compiler/luci/pass/src/QuantizeDequantizeWeightsWithGPTQPass.cpp Outdated

Comment on lines 106 to 107

		min[idx_channel] = data < min[idx_channel] ? data : min[idx_channel];
		max[idx_channel] = data > max[idx_channel] ? data : max[idx_channel];

Contributor

seanshpark Dec 18, 2024

why not use std::min, std::max ???


          [luci/pass] Refactor QuantizeDequantizeWeightsWithGPTQPass.cpp

dc9ecfe

This commit refactors the QuantizeDequantizeWeightsWithGPTQPass.cpp file to improve its readability and maintainability.

ONE-DCO-1.0-Signed-off-by: y01000.you <[email protected]>

01000-you closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet