High latency on a L4 GPU #229

2010b9 · 2025-02-20T15:28:27Z

Due diligence

I have done my due diligence in trying to find the answer myself.

Topic

The PyTorch implementation

Question

Hello!

First of all, congrats! I've been doing some research about open-source speech-to-speech models and yours is by far the most natural one – I'm really excited to see your upcoming developments!

My question is about some high latency I'm experiencing on a L4 GPU when I start the server with python -m moshi.server on a GCP VM instance with a L4 GPU. On the README.md, you state that Moshi achieves a theoretical latency of 160ms (80ms for the frame size of Mimi + 80ms of acoustic delay), with a practical overall latency as low as 200ms on an **L4 GPU**.
As you can see in the image below I'm experiencing latencies up to 11ms. The latency starts to increase as the conversation progresses and I reached the 11ms at about 1min and 42s of conversation.

Do you know what I'm doing wrong?

Note: I'm still a noob in these topics, but very excited and eager to learn!

Thank you in advance!

The text was updated successfully, but these errors were encountered:

LaurentMazare · 2025-02-20T16:52:04Z

That's unexpected, our web infra moshi.chat is running on L4 and has been all good. I suppose that the model is properly running on the GPU as otherwise the stats would be even worse. Any chance that there is something else running on the server? You may want to try out script/moshi_benchmark.py to get some stats for the case where everything takes place locally - this will help knowing whether the issue is that the model is not running in real-time vs some network hickups.

2010b9 · 2025-02-24T12:27:35Z

It throws me the error below when run python3 scripts/moshi_benchmark.py. I need to check why that is happening.

loading mimi
mimi loaded
loading moshi
Traceback (most recent call last):
  File "/home/brunovaz/speech-lms/moshi/scripts/moshi_benchmark.py", line 66, in <module>
    lm = loaders.get_moshi_lm(args.moshi_weight, args.device)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/brunovaz/speech-lms/venv/lib/python3.11/site-packages/moshi/models/loaders.py", line 285, in get_moshi_lm
    model = LMModel(
            ^^^^^^^^
TypeError: moshi.models.lm.LMModel() argument after ** must be a mapping, not str

Nonetheless, if I use rust with cargo run --features cuda --bin moshi-backend -r -- --config moshi-backend/config.json standalone the latencies on the L4 GPU are around 300ms – 500ms.

2010b9 added the question Further information is requested label Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High latency on a L4 GPU #229

High latency on a L4 GPU #229

2010b9 commented Feb 20, 2025

LaurentMazare commented Feb 20, 2025

2010b9 commented Feb 24, 2025

High latency on a L4 GPU #229

High latency on a L4 GPU #229

Comments

2010b9 commented Feb 20, 2025

Due diligence

Topic

Question

LaurentMazare commented Feb 20, 2025

2010b9 commented Feb 24, 2025