Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prompt processing slowing down on Arc GPU #12632

Open
robertvazan opened this issue Dec 29, 2024 · 3 comments
Open

Prompt processing slowing down on Arc GPU #12632

robertvazan opened this issue Dec 29, 2024 · 3 comments
Assignees

Comments

@robertvazan
Copy link

I am using Ollama from intelanalytics/ipex-llm-inference-cpp-xpu docker image to run LLMs on Arc A380 GPU. I am observing a strange issue when speed of prompt processing drops from 250-300t/s to 50-100t/s for no apparent reason. This is measured on a long prompt (so not just noise). Processing of a 15K-token prompt slows down from a minute to several minutes. Once the speed drops, it does not recover on its own. Restarting the container fixes the issue for a while.

There's nothing else running on the GPU (desktop is on an iGPU). I rarely switch models and I have observed the slowdown while continuously using the same model. I notice this every couple of days, but it's probably happening more often without me noticing, because I don't always use long prompts. There is nothing unusual in Ollama log file.

Watching intel_gpu_top, I notice that GPU load goes up and down even when the card performs normally, but when this slowdown happens, this up-and-down fluctuation in GPU load has lower average and lulls with no GPU activity are up to several seconds long.

The model is fully offloaded to the GPU:

llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  4168.09 MiB
llm_load_tensors:  SYCL_Host buffer size =   292.36 MiB
...
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A380 Graphics|    1.6|    128|    1024|   32|  6064M|            1.3.31294|
llama_kv_cache_init:      SYCL0 KV buffer size =  1344.00 MiB
llama_new_context_with_model: KV self size  = 1344.00 MiB, K (f16):  672.00 MiB, V (f16):  672.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.59 MiB
[1735473133] warming up the model with an empty run
llama_new_context_with_model:      SYCL0 compute buffer size =   304.00 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    55.01 MiB

Docker image configuration was inspired by mattcurf's setup:

FROM intelanalytics/ipex-llm-inference-cpp-xpu:latest
ENV ZES_ENABLE_SYSMAN=1
ENV USE_XETLA=OFF
ENV OLLAMA_HOST=0.0.0.0:11434
RUN mkdir -p /llm/ollama && \
    cd /llm/ollama && \
    init-ollama
WORKDIR /llm/ollama
ENTRYPOINT ["./ollama", "serve"]

Ollama is configured to load only one model at a time with one KV cache and to never timeout the current model:

podman create \
    --replace \
    --stop-signal=SIGKILL \
    --name ollama-ipex \
    -p 127.0.0.1:11434:11434 \
    -v ollama-ipex:/root/.ollama \
    -e OLLAMA_DEBUG=1 \
    -e OLLAMA_MAX_LOADED_MODELS=1 \
    -e OLLAMA_NUM_PARALLEL=1 \
    -e OLLAMA_KEEP_ALIVE=-1 \
    --device /dev/dri \
    localhost/ollama-ipex

Other configuration:

  • docker image: intelanalytics/ipex-llm-inference-cpp-xpu:latest pulled on Nov 16, 2024
  • host OS: Fedora 40, kernel 6.12.6-100.fc40.x86_64
  • model: qwen2.5-coder:7b
  • model settings: defaults except context window set to 24*1024
  • GPU: Arc A380 6GB
@leonardozcm
Copy link
Contributor

hi @robertvazan I tried it on A380 and found out the performance has been very stable at around 178t/s. Further investigation needs to be done.

@robertvazan
Copy link
Author

Okay, I have bitten the bullet and upgraded to the latest docker image and... it is still slowing down :-( I wrote a benchmark to prove it. I can now demonstrate the slowdown reliably, which makes this into a reproducible bug.

Here's the result of a benchmark run:

$ ollama.py benchmark_context
19.3K @ 280 t/s
19.3K @ 251 t/s
19.3K @ 193 t/s
19.3K @ 145 t/s
19.3K @ 117 t/s
19.3K @ 98 t/s
19.3K @ 84 t/s
19.3K @ 73 t/s
^CFinishing current iteration...
19.3K @ 60 t/s
$ podman stop ollama-ipex
$ podman start ollama-ipex
$ ollama.py benchmark_context
19.3K @ 273 t/s
^CFinishing current iteration...
19.3K @ 171 t/s

You can see the speed dropping with every new prompt. Container restart recovers the original speed, but it then starts dropping again. The degradation is very quick. For long-context tasks, I essentially have to restart the container after every generation to maintain performance. Restarts however kill the entire KV cache, so my cache hit ratio is now 0%.

Degradation is also reproducible with small context window. It just takes more iterations for the speed to drop below 100 t/s:

$ ollama.py benchmark_context --window=4096 --utilization=0.5
2.1K @ 282 t/s
2.1K @ 267 t/s
2.1K @ 239 t/s
2.1K @ 219 t/s
2.1K @ 198 t/s
2.1K @ 187 t/s
2.1K @ 220 t/s
2.1K @ 215 t/s
2.1K @ 173 t/s
2.1K @ 234 t/s
2.1K @ 174 t/s
2.1K @ 195 t/s
2.1K @ 155 t/s
2.1K @ 165 t/s
2.1K @ 131 t/s
2.1K @ 121 t/s
2.1K @ 116 t/s
2.1K @ 115 t/s
2.1K @ 116 t/s
2.1K @ 79 t/s
2.1K @ 75 t/s
2.1K @ 71 t/s
2.1K @ 78 t/s
2.1K @ 92 t/s
2.1K @ 83 t/s
^CFinishing current iteration...
2.1K @ 78 t/s

You can run the benchmark program below to reproduce the issue:

from io import StringIO
from datetime import datetime, UTC
import requests
import json
import sys
import signal

benchmark_terminating = False

def benchmark_context(*, port=11434, window=24*1024, utilization=0.8, model='qwen2.5-coder:latest'):
    def signal_handler(sig, frame):
        global benchmark_terminating
        if benchmark_terminating:
            print("Benchmark was terminated mid-iteration.")
            sys.exit(0)
        print("Finishing current iteration...")
        benchmark_terminating = True
    signal.signal(signal.SIGINT, signal_handler)
    counter = 1
    while not benchmark_terminating:
        length = 0
        buffer = StringIO()
        while length < utilization * window:
            line = str(counter) + '\n'
            buffer.write(line)
            length += len(line)
            counter += 1
        messages = [{
            'role': 'user',
            'content': buffer.getvalue() + 'Next?'
        }]
        buffer.close()
        for i in range(5):
            messages.append({
                'role': 'assistant',
                'content': str(counter)
            })
            counter += 1
            messages.append({
                'role': 'user',
                'content': 'Next?'
            })
        http_response = requests.post(f'http://localhost:{port}/api/chat', json={
            'model': model,
            'options': {
                'num_ctx': window
            },
            'stream': False,
            'messages': messages
        })
        http_response.raise_for_status()
        response = http_response.json()
        response_text = response['message']['content']
        correct = str(counter) == response_text
        if not correct:
            print(f'Incorrect: {response_text} (expected {counter})')
        counter += 1
        prompt_tokens = response['prompt_eval_count']
        prompt_nanos = response['prompt_eval_duration']
        print(f'{prompt_tokens/1024:,.1f}K @ {prompt_tokens / (prompt_nanos * 1e-9):,.0f} t/s')

if __name__ == '__main__':
    fire.Fire()

In intel_gpu_top, I can see that as the prompt processing speed drops, so does reported GPU frequency as if the GPU is not given enough work to do. I also noticed the GPU is far from fully utilized even immediately after container restart, which is surprising, because prompt processing is an embarrassingly parallel operation. This suggests that prompt processing code is not properly optimized. Anyway, for now I would be glad if the performance at least didn't drop after every request.

Here's intel_gpu_top output shortly after container restart:

intel-gpu-top: Intel Dg2 (Gen12) @ /dev/dri/card1 - 1496/2397 MHz;  23% RC6;       18 irqs/s

         ENGINES     BUSY                                                                             MI_SEMA MI_WAIT
       Render/3D    0.00% |                                                                         |      0%      0%
         Blitter    0.00% |                                                                         |      0%      0%
           Video    0.00% |                                                                         |      0%      0%
    VideoEnhance    0.00% |                                                                         |      0%      0%
       [unknown]   60.02% |███████████████████████████████████████████▉                             |      0%      0%

   PID              NAME     Render/3D          Blitter            Video          VideoEnhance       [unknown]        
296736   ollama_llama_se |                ||                ||                ||                ||                |

And here's output after a few generations when prompt processing is slow:

intel-gpu-top: Intel Dg2 (Gen12) @ /dev/dri/card1 -  242/ 309 MHz;  84% RC6;        5 irqs/s

         ENGINES     BUSY                                                                             MI_SEMA MI_WAIT
       Render/3D    0.00% |                                                                         |      0%      0%
         Blitter    0.00% |                                                                         |      0%      0%
           Video    0.00% |                                                                         |      0%      0%
    VideoEnhance    0.00% |                                                                         |      0%      0%
       [unknown]   63.90% |██████████████████████████████████████████████▊                          |      0%      0%

   PID              NAME     Render/3D          Blitter            Video          VideoEnhance       [unknown]        
296736   ollama_llama_se |                ||                ||                ||                ||                |

What's not visible in the above output is that the bar for the "[unknown]" engine keeps going up to 100% and then down to 0% as if the GPU is given small batches of work with breaks between them.

@leonardozcm
Copy link
Contributor

Great work, I will give it a try

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants