From a12e67a1fad55e4144667a53e7ad49c086a0a91c Mon Sep 17 00:00:00 2001 From: binbin Deng <108676127+plusbang@users.noreply.github.com> Date: Wed, 24 Jul 2024 19:45:08 +0800 Subject: [PATCH] Update subtitle in 6_3_GPU_Whisper-medium --- ch_6_GPU_Acceleration/6_3_GPU_Whisper-medium.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/ch_6_GPU_Acceleration/6_3_GPU_Whisper-medium.md b/ch_6_GPU_Acceleration/6_3_GPU_Whisper-medium.md index c15e259..1ab12bf 100644 --- a/ch_6_GPU_Acceleration/6_3_GPU_Whisper-medium.md +++ b/ch_6_GPU_Acceleration/6_3_GPU_Whisper-medium.md @@ -1,4 +1,4 @@ -# 6.2 Run Whisper (medium) on Intel GPUs +# 6.3 Run Whisper (medium) on Intel GPUs You can use IPEX-LLM to load Transformer-based automatic speech recognition (ASR) models for acceleration on Intel GPUs. With IPEX-LLM, PyTorch models (in FP16/BF16/FP32) for ASR can be loaded and optimized automatically on Intel GPUs with low-bit quantization (supported precisions include INT4/NF4/INT5/FP6/FP8/INT8). @@ -7,11 +7,11 @@ In this tutorial, you will learn how to run speech models on Intel GPUs with IPE > [!NOTE] > Please make sure that you have prepared the environment for IPEX-LLM on GPU before you started. Refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html) for more information regarding installation and environment preparation. Besides, to process audio files, you also need to install `librosa` by performing `pip install -U librosa`. -## 6.2.1 Download Audio Files +## 6.3.1 Download Audio Files To start with, the first thing to do is preparing some audio files for this demo. As an example, you can download an English example from multilingual audio dataset [voxpopuli](https://huggingface.co/datasets/facebook/voxpopuli) and one Chinese example from the Chinese audio dataset [AIShell](https://huggingface.co/datasets/carlot/AIShell). You are free to pick other recording clips found within or outside the dataset. -## 6.2.2 Load Model in Low Precision +## 6.3.2 Load Model in Low Precision One common use case is to load a model from Hugging Face with IPEX-LLM low-bit precision optimization. For Whisper (medium), you could simply import `ipex_llm.transformers.AutoModelForSpeechSeq2Seq` instead of `transformers.AutoModelForSpeechSeq2Seq`, and specify `load_in_4bit=True` parameter accordingly in the `from_pretrained` function. @@ -43,7 +43,7 @@ model_in_4bit_gpu = model_in_4bit.to("xpu") > > * You could refer to the [API documentation](https://ipex-llm.readthedocs.io/en/latest/doc/PythonAPI/LLM/transformers.html) for more information. -## 6.2.3 Load Whisper Processor +## 6.3.3 Load Whisper Processor A Whisper processor is also needed for both audio pre-processing, and post-processing model outputs from tokens to texts. IPEX-LLM does not provide a customized implementation for that, so you might want to use the official `transformers` API to load `WhisperProcessor`: @@ -56,7 +56,7 @@ processor = WhisperProcessor.from_pretrained(pretrained_model_name_or_path="open > [!NOTE] > If you have already downloaded the Whisper (medium) model, you could specify `pretrained_model_name_or_path` to the model path. -## 6.2.4 Run Model to Transcribe English Audio +## 6.3.4 Run Model to Transcribe English Audio Once you have optimized the Whisper model using IPEX-LLM with INT4 optimization and loaded the Whisper processor, you are ready to begin transcribing the audio through model inference. @@ -100,7 +100,7 @@ with torch.inference_mode(): > -## 6.2.5 Run Model to Transcribe Chinese Audio and Translate to English +## 6.3.5 Run Model to Transcribe Chinese Audio and Translate to English Next, let's move to the Chinese audio `audio_zh.wav`, which is randomly taken from the [AIShell](https://huggingface.co/datasets/carlot/AIShell) dataset. Whisper offers capability to transcribe multilingual audio files, and translate the recognized text into English. The only difference here is to define specific context token through `forced_decoder_ids`: