Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QNN EP] Make QNN EP a shared library #23120

Open
wants to merge 110 commits into
base: main
Choose a base branch
from

Conversation

adrianlizarraga
Copy link
Contributor

@adrianlizarraga adrianlizarraga commented Dec 16, 2024

Description

  • Makes QNN EP a shared library by default when building with --use_qnn or --use_qnn shared_lib. Generates the following build artifacts:
    • Windows: onnxruntime_providers_qnn.dll and onnxruntime_providers_shared.dll
    • Linux: libonnxruntime_providers_qnn.so and libonnxruntime_providers_shared.so
    • Android: Not supported. Must build QNN EP as a static library.
  • Allows QNN EP to still be built as a static library with --use_qnn static_lib. This is primarily for the Android QNN AAR package.
  • Unit tests run for both the static and shared QNN EP builds.

Detailed changes

  • Updates Java bindings to support both shared and static QNN EP builds.
  • Provider bridge API:
    • Adds logging sink ETW to the provider bridge. Allows EPs to register ETW callbacks for ORT logging.
    • Adds a variety of methods for onnxruntime objects that are needed by QNN EP.
  • QNN EP:
    • Adds ort_api.h and ort_api.cc that encapsulates the API provided by ORT in a manner that allows the EP to be built as either a shared or static library.
    • Adds custom function to transpose weights for Conv and Gemm (instead of adding util to provider bridge API).
    • Adds custom function to quantize data for LeakyRelu (instead of adding util to provider bridge API).
    • Adds custom ETW tracing for QNN profiling events:
      • shared library: defines its own TraceLogging provider handle
      • static library: uses ORT's TraceLogging provider handle and existing telemetry provider.
  • ORT-QNN Packages:
    • Python: Pipelines build QNN EP as a shared library by default. User can build a local python wheel with QNN EP as a static library by passing --use_qnn static_lib.
    • NuGet: Pipelines build QNN EP as a shared library by default. build.py currently enforces QNN EP to be built as a shared library. Can add support for building a QNN NuGet package with static later if deemed necessary.
    • Android: Pipelines build QNN EP as a static library. build.py enforces QNN EP to be built as a static library. Packaging multiple shared libraries into an Android AAR package is not currently supported due to the added need to also distribute a shared libcpp.so library.

Motivation and Context

…evert this in favor of doing the transpose manually in QNN EP
…entType(), DataTypeImpl::TensorTypeFromONNXEnum()
…hat does not need to add new functionality to the provider bridge
new_tensor_shape_dims.push_back(tensor_shape_dims[p]);
// Internal function to transpose data of rank 5 with the given permutation.
// Example: transpose input from either (N,C,H,W,D) or (C,N,H,W,D) to (H,W,D,C,N).
static Status TransposeDataRank5(const TensorShape& input_shape,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason to replace the existing TransposeInitializer method? The existing one can handle any rank. Is it because it's re-using CPU EP implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. We would need to add the existing CPU EP implementation to the provider bridge. Since we're only using Transpose of rank5 and rank2 in QNN EP, I don't think it is worth the complexity.

Also, looking ahead to the EP-as-plugins project, we want to minimize the API between ORT and EPs.

// Copy initializer bytes (stored in little-endian order) to vector of int64_t.
// ReadLittleEndian returns a status error if the source and destination spans do not have
// matching byte sizes.
ORT_RETURN_IF_ERROR(onnxruntime::utils::ReadLittleEndian(src_span, dst_span));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember why we used ReadLittleEndian here. It should be safe for EP level right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally added the use of ReadLittlenEndian. It is unnecessary because QNN EP only runs on little endian architectures. We assume little-endian throughout the code.

Should probably add a check for little-endian in the function that creates a QnnProviderFactory (fail if not little-endian).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be transparent to EPs. TensorProtocolUtils in framework should have covered this already.

Copy link
Contributor Author

@adrianlizarraga adrianlizarraga Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right. When we call UnpackInitializer(initializer, buffer) tensorprotoutils correctly handles reading the data from onnx initializer and storing it in a little-endian byte buffer. However, when we are directly copying this buffer of bytes to an gsl::span<int32_t>, we're implicitly assuming that QNN EP is also running on a little-endian machine. This is why I initially added the call to ReadLittleEndian here. However, there are many places in QNN EP where we just reinterpret_cast initializer bytes (little-endian) into a data type like float, which assumes little-endian. So, I'm thinking we just formalize this and say QNN EP currently on runs on little endian machines.

std::vector<uint8_t> original_tensor_bytes;
ORT_RETURN_IF_ERROR(qnn_model_wrapper.UnpackInitializerData(*input_info.initializer_tensor, original_tensor_bytes));
unpacked_tensor.resize(original_tensor_bytes.size());
size_t elem_byte_size = qnn::utils::GetElementSizeByType(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to validate elem_byte_size to make sure it's not 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@jywu-msft jywu-msft requested review from jslhcl and 007zszmz January 9, 2025 18:40
HectorSVC
HectorSVC previously approved these changes Jan 10, 2025
adrianlizarraga added a commit that referenced this pull request Jan 17, 2025
…23402)

### Description
- Fixes segfault when the function that cleans up HTP memory handles
uses an invalid Logger.
- Fixes unit test that compares output from QNN EP with exact float
values. QNN HTP runs float32 models with float16 precision, so need to
use a tolerance in the comparison.



### Motivation and Context
Fixes issues with using QNN HTP memory sharing on Windows ARM64. This is
also needed to test HTP shared memory with
#23120.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:QNN issues related to QNN exeution provider
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants