Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High latency on a L4 GPU #229

Open
1 task done
2010b9 opened this issue Feb 20, 2025 · 2 comments
Open
1 task done

High latency on a L4 GPU #229

2010b9 opened this issue Feb 20, 2025 · 2 comments
Labels
question Further information is requested

Comments

@2010b9
Copy link

2010b9 commented Feb 20, 2025

Due diligence

  • I have done my due diligence in trying to find the answer myself.

Topic

The PyTorch implementation

Question

Hello!

First of all, congrats! I've been doing some research about open-source speech-to-speech models and yours is by far the most natural one – I'm really excited to see your upcoming developments!

My question is about some high latency I'm experiencing on a L4 GPU when I start the server with python -m moshi.server on a GCP VM instance with a L4 GPU. On the README.md, you state that Moshi achieves a theoretical latency of 160ms (80ms for the frame size of Mimi + 80ms of acoustic delay), with a practical overall latency as low as 200ms on an **L4 GPU**.
As you can see in the image below I'm experiencing latencies up to 11ms. The latency starts to increase as the conversation progresses and I reached the 11ms at about 1min and 42s of conversation.

Image

Do you know what I'm doing wrong?

Note: I'm still a noob in these topics, but very excited and eager to learn!

Thank you in advance!

@2010b9 2010b9 added the question Further information is requested label Feb 20, 2025
@LaurentMazare
Copy link
Member

That's unexpected, our web infra moshi.chat is running on L4 and has been all good. I suppose that the model is properly running on the GPU as otherwise the stats would be even worse. Any chance that there is something else running on the server? You may want to try out script/moshi_benchmark.py to get some stats for the case where everything takes place locally - this will help knowing whether the issue is that the model is not running in real-time vs some network hickups.

@2010b9
Copy link
Author

2010b9 commented Feb 24, 2025

It throws me the error below when run python3 scripts/moshi_benchmark.py. I need to check why that is happening.

loading mimi
mimi loaded
loading moshi
Traceback (most recent call last):
  File "/home/brunovaz/speech-lms/moshi/scripts/moshi_benchmark.py", line 66, in <module>
    lm = loaders.get_moshi_lm(args.moshi_weight, args.device)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/brunovaz/speech-lms/venv/lib/python3.11/site-packages/moshi/models/loaders.py", line 285, in get_moshi_lm
    model = LMModel(
            ^^^^^^^^
TypeError: moshi.models.lm.LMModel() argument after ** must be a mapping, not str

Nonetheless, if I use rust with cargo run --features cuda --bin moshi-backend -r -- --config moshi-backend/config.json standalone the latencies on the L4 GPU are around 300ms – 500ms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants