-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prompt processing slowing down on Arc GPU #12632
Comments
hi @robertvazan I tried it on A380 and found out the performance has been very stable at around 178t/s. Further investigation needs to be done. |
Okay, I have bitten the bullet and upgraded to the latest docker image and... it is still slowing down :-( I wrote a benchmark to prove it. I can now demonstrate the slowdown reliably, which makes this into a reproducible bug. Here's the result of a benchmark run:
You can see the speed dropping with every new prompt. Container restart recovers the original speed, but it then starts dropping again. The degradation is very quick. For long-context tasks, I essentially have to restart the container after every generation to maintain performance. Restarts however kill the entire KV cache, so my cache hit ratio is now 0%. Degradation is also reproducible with small context window. It just takes more iterations for the speed to drop below 100 t/s:
You can run the benchmark program below to reproduce the issue: from io import StringIO
from datetime import datetime, UTC
import requests
import json
import sys
import signal
benchmark_terminating = False
def benchmark_context(*, port=11434, window=24*1024, utilization=0.8, model='qwen2.5-coder:latest'):
def signal_handler(sig, frame):
global benchmark_terminating
if benchmark_terminating:
print("Benchmark was terminated mid-iteration.")
sys.exit(0)
print("Finishing current iteration...")
benchmark_terminating = True
signal.signal(signal.SIGINT, signal_handler)
counter = 1
while not benchmark_terminating:
length = 0
buffer = StringIO()
while length < utilization * window:
line = str(counter) + '\n'
buffer.write(line)
length += len(line)
counter += 1
messages = [{
'role': 'user',
'content': buffer.getvalue() + 'Next?'
}]
buffer.close()
for i in range(5):
messages.append({
'role': 'assistant',
'content': str(counter)
})
counter += 1
messages.append({
'role': 'user',
'content': 'Next?'
})
http_response = requests.post(f'http://localhost:{port}/api/chat', json={
'model': model,
'options': {
'num_ctx': window
},
'stream': False,
'messages': messages
})
http_response.raise_for_status()
response = http_response.json()
response_text = response['message']['content']
correct = str(counter) == response_text
if not correct:
print(f'Incorrect: {response_text} (expected {counter})')
counter += 1
prompt_tokens = response['prompt_eval_count']
prompt_nanos = response['prompt_eval_duration']
print(f'{prompt_tokens/1024:,.1f}K @ {prompt_tokens / (prompt_nanos * 1e-9):,.0f} t/s')
if __name__ == '__main__':
fire.Fire() In Here's
And here's output after a few generations when prompt processing is slow:
What's not visible in the above output is that the bar for the "[unknown]" engine keeps going up to 100% and then down to 0% as if the GPU is given small batches of work with breaks between them. |
Great work, I will give it a try |
I am using Ollama from
intelanalytics/ipex-llm-inference-cpp-xpu
docker image to run LLMs on Arc A380 GPU. I am observing a strange issue when speed of prompt processing drops from 250-300t/s to 50-100t/s for no apparent reason. This is measured on a long prompt (so not just noise). Processing of a 15K-token prompt slows down from a minute to several minutes. Once the speed drops, it does not recover on its own. Restarting the container fixes the issue for a while.There's nothing else running on the GPU (desktop is on an iGPU). I rarely switch models and I have observed the slowdown while continuously using the same model. I notice this every couple of days, but it's probably happening more often without me noticing, because I don't always use long prompts. There is nothing unusual in Ollama log file.
Watching
intel_gpu_top
, I notice that GPU load goes up and down even when the card performs normally, but when this slowdown happens, this up-and-down fluctuation in GPU load has lower average and lulls with no GPU activity are up to several seconds long.The model is fully offloaded to the GPU:
Docker image configuration was inspired by mattcurf's setup:
Ollama is configured to load only one model at a time with one KV cache and to never timeout the current model:
Other configuration:
intelanalytics/ipex-llm-inference-cpp-xpu:latest
pulled on Nov 16, 2024qwen2.5-coder:7b
The text was updated successfully, but these errors were encountered: