Deployed with sha d19ae2b

intel · Feb 28, 2024 · b2b3f93 · b2b3f93
commit b2b3f93
Show file tree

Hide file tree

Showing 175 changed files with 14,570 additions and 0 deletions.
diff --git a/.buildinfo b/.buildinfo
@@ -0,0 +1,4 @@
+# Sphinx build info version 1
+# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
+config: 3df2e428938d87c4ad1a69e463cd4a95
+tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/.doctrees/cpp_reference.doctree b/.doctrees/cpp_reference.doctree
diff --git a/.doctrees/developer.doctree b/.doctrees/developer.doctree
diff --git a/.doctrees/environment.pickle b/.doctrees/environment.pickle
diff --git a/.doctrees/index.doctree b/.doctrees/index.doctree
diff --git a/.doctrees/llm.doctree b/.doctrees/llm.doctree
diff --git a/.doctrees/llm_performance.doctree b/.doctrees/llm_performance.doctree
diff --git a/.doctrees/npu.doctree b/.doctrees/npu.doctree
diff --git a/.doctrees/python/intel_npu_acceleration_library.backend.doctree b/.doctrees/python/intel_npu_acceleration_library.backend.doctree
diff --git a/.doctrees/python/intel_npu_acceleration_library.doctree b/.doctrees/python/intel_npu_acceleration_library.doctree
diff --git a/.doctrees/python/intel_npu_acceleration_library.nn.doctree b/.doctrees/python/intel_npu_acceleration_library.nn.doctree
diff --git a/.doctrees/python/modules.doctree b/.doctrees/python/modules.doctree
diff --git a/.doctrees/setup.doctree b/.doctrees/setup.doctree
diff --git a/.doctrees/usage.doctree b/.doctrees/usage.doctree
diff --git a/.nojekyll b/.nojekyll
diff --git a/_sources/cpp_reference.rst b/_sources/cpp_reference.rst
@@ -0,0 +1,5 @@
+C++ API Reference
+=================
+
+.. doxygenindex::
+   :project: Intel® NPU Acceleration Library
diff --git a/_sources/developer.md b/_sources/developer.md
@@ -0,0 +1,87 @@
+# Developer Guide
+
+Install developer packages by typing
+
+```bash
+pip install .[dev]
+```
+
+It is suggested to install the package locally by using `pip install -e .[dev]`
+
+## Git hooks
+
+All developers should install the git hooks that are tracked in the `.githooks` directory. We use the pre-commit framework for hook management. The recommended way of installing it is using pip:
+
+```bash
+pre-commit install
+```
+
+If you want to manually run all pre-commit hooks on a repository, run `pre-commit run --all-files`. To run individual hooks use `pre-commit run <hook_id>`.
+
+Uninstalling the hooks can be done using
+
+```bash
+pre-commit uninstall
+```
+
+## Testing the library
+
+### Python test
+
+Python test uses `pytest` library. Type
+
+```bash
+cd test/python && pytest
+```
+
+to run the full test suite.
+
+## Build the documentation
+
+This project uses `sphinx` to build and deploy the documentation. To serve locally the documentation type
+
+```bash
+mkdocs serve
+```
+
+to deploy it into github pages type
+
+```bash
+cd docs
+python build_doc.py gh-deploy
+```
+
+## Generate python packages
+
+On windows:
+
+```bat
+python setup.py sdist
+set CIBW_BUILD=cp*
+cibuildwheel --platform windows --output-dir dist
+```
+
+
+## Publishing packets
+
+Install twine
+```bat
+python3 -m pip install --upgrade twine
+```
+
+Then check on the built sdist and wheel that are properly formatted (all files should return a green `PASSED`)
+
+```bat
+twine check dist/*
+```
+
+Upload the packets to `testpypi`
+
+```bat
+twine upload --repository testpypi dist/*
+```
+
+To upload them to the real index (**verify first with testpypi**)
+```bat
+twine upload dist/*
+```
diff --git a/_sources/index.rst b/_sources/index.rst
@@ -0,0 +1,116 @@
+.. Intel® NPU Acceleration Library documentation master file, created by
+   sphinx-quickstart on Wed Feb  7 11:48:32 2024.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+
+Welcome to Intel® NPU Acceleration Library's documentation!
+=====================================
+
+The Intel® NPU Acceleration Library is a Python library designed to boost the efficiency of your applications by leveraging the power of the Intel Neural Processing Unit (NPU) to perform high-speed computations on compatible hardware.
+
+Installation
+-------------
+
+Download the ``*.whl`` file relative to your setup from the latest ``intel-npu-acceleration-library`` `release page <https://github.com/intel/intel-npu-acceleration-library/releases/latest>`_. If for example you have ``python 3.9`` and a ``x64`` operative system installation you should use ``intel_npu_acceleration_library-*-cp39-cp39-win_amd64.whl``
+
+Once downloaded, you can install it in your machine with
+
+.. code-block:: bash
+
+   pip install intel-npu-acceleration-library
+
+
+Run a LLaMA model on the NPU
+----------------------------
+
+To run LLM models you need to install the `transformers` library
+
+
+.. code-block:: bash
+
+   pip install transformers
+
+You are now up and running! You can create a simple script like the following one to run a LLM on the NPU
+
+
+.. code-block:: python
+   :emphasize-lines: 12, 13
+
+   from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM
+   import intel_npu_acceleration_library
+   import torch
+
+   model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
+
+   model = AutoModelForCausalLM.from_pretrained(model_id, use_cache=True).eval()
+   tokenizer = AutoTokenizer.from_pretrained(model_id, use_default_system_prompt=True)
+   tokenizer.pad_token_id = tokenizer.eos_token_id
+   streamer = TextStreamer(tokenizer, skip_special_tokens=True)
+
+   print("Compile model for the NPU")
+   model = intel_npu_acceleration_library.compile(model, dtype=torch.int8)
+
+   query = input("Ask something: ")
+   prefix = tokenizer(query, return_tensors="pt")["input_ids"]
+
+   generation_kwargs = dict(
+      input_ids=prefix,
+      streamer=streamer,
+      do_sample=True,
+      top_k=50,
+      top_p=0.9,
+      max_new_tokens=512,
+   )
+
+   print("Run inference")
+   _ = model.generate(**generation_kwargs)
+
+
+Take note that you only need to use `intel_npu_acceleration_library.compile` to offload the heavy computation to the NPU.
+
+Feel free to check `Usage <usage.html>`_ and `LLM <llm.html>`_ and the `examples <https://github.com/intel/intel-npu-acceleration-library/tree/main/examples>`_ folder for additional use-cases and examples.
+
+
+
+Site map
+----------------------------
+
+.. toctree::
+   Quickstart <self>
+   NPU overview <npu.md>
+   usage.md
+   setup.md
+   :maxdepth: 1
+   :caption: Library overview:
+
+
+.. toctree::
+   llm.md
+   llm_performance.md
+   :maxdepth: 1
+   :caption: Applications:
+
+
+
+.. toctree::
+   developer.md
+   :maxdepth: 1
+   :caption: Developements guide:
+
+
+
+.. toctree::
+   Python API Reference <python/intel_npu_acceleration_library.rst>
+   cpp_reference.rst
+   :maxdepth: 1
+   :caption: API Reference:
+
+
+
+
+Indices and tables
+==================
+
+* :ref:`genindex`
+* :ref:`modindex`
+* :ref:`search`
diff --git a/_sources/llm.md b/_sources/llm.md
@@ -0,0 +1,53 @@
+# Large Language models
+
+
+## Run an LLM on the NPU
+
+You can use your existing LLM inference script on the NPU with a simple line of code
+
+```python
+# First import the library
+import intel_npu_acceleration_library
+
+# Call the compile function to offload kernels to the NPU.
+model = intel_npu_acceleration_library.compile(model)
+```
+
+Here a full example:
+
+```python
+from torch.profiler import profile, ProfilerActivity
+from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM
+from threading import Thread
+import intel_npu_acceleration_library
+import torch
+import time
+import sys
+
+model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
+
+model = AutoModelForCausalLM.from_pretrained(model_id, use_cache=True).eval()
+tokenizer = AutoTokenizer.from_pretrained(model_id, use_default_system_prompt=True)
+tokenizer.pad_token_id = tokenizer.eos_token_id
+streamer = TextStreamer(tokenizer, skip_special_tokens=True)
+
+
+print("Compile model for the NPU")
+model = intel_npu_acceleration_library.compile(model)
+
+query = "What is the meaning of life?"
+prefix = tokenizer(query, return_tensors="pt")["input_ids"]
+
+
+generation_kwargs = dict(
+    input_ids=prefix,
+    streamer=streamer,
+    do_sample=True,
+    top_k=50,
+    top_p=0.9,
+)
+
+print("Run inference")
+_ = model.generate(**generation_kwargs)
+
+```
diff --git a/_sources/llm_performance.md b/_sources/llm_performance.md
@@ -0,0 +1,70 @@
+# Decoding LLM performance
+
+Decoding and understanding the performance of large language models (LLMs) is critical for optimizing their efficiency and effectiveness. The inference process of an LLM can be broken down into three distinct phases, each with its unique characteristics and performance considerations, as shown in the following figure.
+
+![LLM performance](llm_perf.png)
+
+## Load phase
+
+The load phase encompasses the initial steps of bringing an LLM into action, starting from loading the model into memory until the `model.generate()` call is initiated.
+
+### Phase Steps
+
+- Weight Loads: load phase latency is largely dependent on how quickly the model weights can be loaded from the disk as part of model initialization.
+- Quantization: Quantization involves the process of reducing the precision of the weights, which can impact the performance. This step is designed to balance the trade-off between model accuracy and the computational efficiency of the model. Quantizing the weights involves analyzing the entire model to lower the precision of its parameters. Depending on its implementation, it can be an expensive process and might require fine-tuning the model for best performance.
+- Compilation: is the process of transforming the original model into a format that can be run on the NPU. It involves some model optimizations as well as lowering the operation into a NPU runtime format.
+
+### Implications
+
+- CPU/Disk Bound: since this phase relies heavily on I/O operations and CPU activities, the underlying CPU and disk speed is what bounds performance.
+- Pre-compilation: quantizing and in a lesser extent compiling a model might result in a significative latency. It is suggested to prepare the model offline and not during the application if it is possible. An example of how this can be done is in the `export.py` script in the `script` folder. That do not removes the needs to load the weights from the disk at initialization stage but remove the compilation and quantization latency.
+
+## Prefill phase
+
+In the prefill phase, the model analyzes the user prompt to produce the initial output token. The primary metric used is `prefill-time` (a.k.a. first inference latency), which gauges the duration from the LLM's initiation to the generation of the first token. This interval is commonly interpreted by users as the "LLM startup time" as it denotes the period from when they commence typing to when the LLM begins its response. A brief `prefill-time` enhances system responsiveness and user satisfaction.
+
+### Phase Steps
+
+- Fist inference: model first inference on the user's prompt. This process can be computationally intensive, particularly with long prompts, as it processing requires significant matrix-matrix multiplications.
+- Key-Value cache (KV-cache): the prompt key and value output from every attention layer can be cached for the next tokens generation in order to save computation.
+
+### Implications
+
+- Compute bounded (NPU): the initial inference process is primarily limited by computational resources (NPU) due to the typically substantial size of the user's prompt.
+- Input prompt size: The latency during this phase is contingent upon the length of the user's prompt. A lengthier prompt results in a quadratic increase in runtime due to the LLM's multi-head attention block.
+
+## Token Phase
+
+After the prefill, the LLM enters the token phase, where it generates the remaining tokens in the output sequence. The primary metrics used are `token-time` and `tokens/s`. They measure how quickly the
+
+### Phase Steps
+
+- Inference: The generated token alongside the KV-cache is passed as input to the model. Because of KV-cache optimization, the required compute is fairly limited as effectively the LLM runs with a single new token as input.
+- Weight loads: while compute is limited, the model still needs to load the entire weight-set (potentially billions of parameters) to perform the computation. Therefore, execution is mostly limited by DRAM bandwidth rather than compute capability.
+
+### Implications
+
+- DRAM Bandwidth: This stage of the inference is driven significantly by the bandwidth of the DRAM. The rate at which the LLM parameters are transferred from DRAM to the processing units has a considerable effect on the token time.
+- Performance Factors: Although NPU performance still matters, it becomes less of the bottleneck compared to the available DRAM bandwidth.
+
+## System/application parameters
+
+Beyond the phases, certain system parameters significantly influence the performance of LLMs.
+
+- Model Architecture and Size: The architecture and the size of the model dictate its performance. Larger models, which have more parameters, may provide more accurate results but are also more challenging to fit within the physical memory limits of a system.
+- DRAM size and speed: Once DRAM is filled, the performance can become bottlenecked. If the model and its KV-cache overflow the available DRAM, the system will need to swap memory to disk leading to a much slower inference.
+- Prompt Length: Different applications may require support for varying prompt lengths. Longer prompts translate into larger context sizes, increasing the demand on cache and tensor resources.
+- LLM Context Size: As the context size grows (large prompt and/or significative number of newly generated tokens) and hits the DRAM limit, performance may again become SWAP/SSD bounded due to insufficient DRAM to contain the larger KV-cache tensors.
+
+# Performance improvement
+
+Increasing the DRAM size/speed:
+
+Model Quantization: quantization reduces model footprint and enables faster computations on supported hardware. This is expected to give performance benefits on all inference phases. It is important to notice that quantization by itself might reduce model quality and accuracy and so LLM performance should be the target of extensive investigation.
+
+Static shape inference: many inference AI accelerators (Intel NPU, IPU, TPU, etc...) requires static shapes get maximum performance. Static shapes allows the NN graph compiler to improve memory management, schedule and overall network performance. For a example implementation, you can refer to the `intel_npu_acceleration_library.nn.llm.generate_with_static_shape` or `transformers` library [StaticCache](https://huggingface.co/docs/transformers/v4.38.1/en/internal/generation_utils#transformers.StaticCache)
+
+
+## Conclusions
+
+Understanding these phases and system parameters is crucial to diagnose performance bottlenecks, to fairly compare LLM performance over different accelerators and to develop strategies for optimizing the deployment and execution of LLMs on client and edge platform. By paying close attention to these aspects, one can ensure that the model operates efficiently, providing quick and accurate responses to user prompts.
diff --git a/_sources/npu.md b/_sources/npu.md
@@ -0,0 +1,24 @@
+# Quick overview of Intel's Neural Processing Unit (NPU)
+
+The Intel NPU is an AI accelerator integrated into Intel Core Ultra processors, characterized by a unique architecture comprising compute acceleration and data transfer capabilities. Its compute acceleration is facilitated by Neural Compute Engines, which consist of hardware acceleration blocks for AI operations like Matrix Multiplication and Convolution, alongside Streaming Hybrid Architecture Vector Engines for general computing tasks.
+
+![Intel NPU architecture](npu_arch.png)
+
+- **Scalable Multi-Tile Design:** The heart of the NPU's compute acceleration capability lies in its scalable tiled based architecture known as Neural Compute Engines.
+- **Hardware Acceleration Blocks:** These engines are equipped with specific hardware blocks designed to handle AI operations that demand high levels of computation, such as Matrix Multiplication and Convolution.
+- **Streaming Hybrid Architecture:** Alongside the dedicated AI operation units, the Neural Compute Engines are built with Streaming Hybrid Architecture Vector Engines (SHAVE). This enables them to perform high-performance parallel computing for general compute needs.
+- **DMA Engines:** Direct Memory Access (DMA) engines are integral to the NPU, responsible for moving data efficiently between the system memory DRAM and the software-managed cache.
+- **Memory Management:** The incorporation of a built-in device MMU, alongside an IOMMU, allows support for multiple concurrent hardware contexts. This is crucial for maintaining security isolation between these contexts in line with the Microsoft Compute Driver Model (MCDM) architectural standards.
+
+## The Role of Software
+
+While the hardware is undoubtedly advanced, the true "magic" of the Intel NPU is realized through a sophisticated MLIR based compiler. It is through compiler technology that Intel's NPU reaches its full potential by optimizing and orchestrating AI workloads.
+
+- **Parallel Workload Execution:** The compiler ensures that AI tasks are executed in parallel, directing both compute and data flows in a tiling pattern with built-in and programmable control flows.
+- **Maximizing Compute Utilization:** By prioritizing execution primarily out of scratchpad SRAM and reducing the data transfers between SRAM and DRAM, the compiler helps in achieving optimum performance-to-power ratios for AI workloads.
+
+Some useful links
+
+- Intel AI PC ([link](https://www.intel.com/content/www/us/en/products/docs/processors/core-ultra/ai-pc.html?wapkw=NPU))
+- Intel Core Ultra Processor line ([link](https://www.intel.com/content/www/us/en/products/docs/processors/core-ultra/core-ultra-series-1-product-brief.html?wapkw=NPU))
+- AI Acceleration and NPU explained ([video](https://www.youtube.com/watch?v=QSzNoX0qplE))