Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow in computing features for .m4a files #1460

Closed
csukuangfj opened this issue Mar 5, 2025 · 2 comments
Closed

Slow in computing features for .m4a files #1460

csukuangfj opened this issue Mar 5, 2025 · 2 comments

Comments

@csukuangfj
Copy link
Contributor

We have many *.m4a files, where each file is several hours long and contains multiple supervision segments.

We have used

def trim_to_supervisions(

to split a long cutset into smaller segments.

During computing features of each segment, we find that it is very slow. After some debugging, we find

Image

Looks like it uses

ffmpeg -i xxx.m4a  -f s16le -

to read the whole file xxx.m4a, which is not expected since we only need to extract a segment.

Also, our *.m4a files usually contain two channels. We hope that it reads only one channel from the file instead of all channels and discards unused channel(s) afterwards.


To help reproduce, we have prepared a colab notebook at
https://colab.research.google.com/drive/1ZU2M_z9kY503UKPYMuYX0BQ5BLuVi6jt?usp=sharing

sox -r 16000 -n -b 16 -c 1 a.wav synth 4320 sin 1000
ffmpeg -i a.wav a.m4a

it first generates a m4a file of length 4320 seconds.

import time
import lhotse

recording = lhotse.Recording.from_file('a.m4a')

then we construct a recording from it.

## Test loading the whole file
import time
for i in range(10):
  start = time.time()
  audio = recording.load_audio()
  end = time.time()
  print(f'Iter {i}: {end-start} s')

The output is given below

Iter 0: 2.1797096729278564 s
Iter 1: 2.187424659729004 s
Iter 2: 1.7927534580230713 s
Iter 3: 2.0300517082214355 s
Iter 4: 2.191190242767334 s
Iter 5: 1.7366209030151367 s
Iter 6: 1.7066655158996582 s
Iter 7: 1.7146885395050049 s
Iter 8: 1.7510318756103516 s
Iter 9: 1.6823420524597168 s

You can see that it takes about 2 seconds to load the whole file (4320 seconds).

## Test loading 100 seconds
import time
for i in range(10):
  start = time.time()
  audio = recording.load_audio(offset=200, duration=100)
  end = time.time()
  print(f'Iter {i}: {end-start} s')

The output is given below

Iter 0: 1.2440290451049805 s
Iter 1: 1.5040011405944824 s
Iter 2: 1.0565969944000244 s
Iter 3: 1.477067232131958 s
Iter 4: 1.03183913230896 s
Iter 5: 1.0384526252746582 s
Iter 6: 0.9423067569732666 s
Iter 7: 1.135765790939331 s
Iter 8: 1.038952350616455 s
Iter 9: 0.9517149925231934 s

You can see it takes comparable time as loading the whole file even if it requests loading 100s of the file.

## Test loading 5 seconds
import time
for i in range(10):
  start = time.time()
  audio = recording.load_audio(offset=100, duration=5)
  end = time.time()
  print(f'Iter {i}: {end-start} s')

The output is

Iter 0: 1.0234184265136719 s
Iter 1: 1.2550442218780518 s
Iter 2: 1.5072486400604248 s
Iter 3: 1.1016292572021484 s
Iter 4: 1.038039207458496 s
Iter 5: 0.9318227767944336 s
Iter 6: 1.1400001049041748 s
Iter 7: 1.0508427619934082 s
Iter 8: 1.0651192665100098 s
Iter 9: 1.0305383205413818 s

Even worse, reading only 5 seconds of the file also takes a long time.


If we use ffmpeg directly to read part of a file

for i in $(seq 10); do
  time ffmpeg -hide_banner -loglevel error -threads 1 -ss 200 -t 100 -i a.m4a -threads 1 100s.wav
  rm -f 100s.wav
done

The output is

real	0m0.172s
user	0m0.140s
sys	0m0.108s

real	0m0.171s
user	0m0.137s
sys	0m0.099s

real	0m0.175s
user	0m0.149s
sys	0m0.099s

real	0m0.169s
user	0m0.143s
sys	0m0.100s

real	0m0.173s
user	0m0.140s
sys	0m0.101s

real	0m0.200s
user	0m0.159s
sys	0m0.099s

real	0m0.168s
user	0m0.138s
sys	0m0.098s

real	0m0.172s
user	0m0.143s
sys	0m0.098s

real	0m0.163s
user	0m0.135s
sys	0m0.097s

real	0m0.172s
user	0m0.140s
sys	0m0.100s

You can see it is much faster.


This commit
csukuangfj@2dd8125
changes the current audio backed to use ffmpeg commandline, instead of torchaudio.load() to handle m4a files.

With the above commit, we have the following from the output of htop

Image

The RTF on CPU for extracting features is about 0.0003-0.0008.

@pzelasko
Copy link
Collaborator

pzelasko commented Mar 5, 2025

Thanks for the detailed description. I managed to reproduce your issue. You can select a different existing audio backend that is more performant for m4a.

I got the best result with this:

with lhotse.audio_backend("FfmpegTorchaudioStreamerBackend"):
  for i in range(10):
    start = time.time()
    audio = recording.load_audio(offset=100, duration=5)
    end = time.time()
    print(f'Iter {i}: {end-start} s')

Output:

Iter 0: 0.014747142791748047 s
Iter 1: 0.011191129684448242 s
Iter 2: 0.009851455688476562 s
Iter 3: 0.00884866714477539 s
Iter 4: 0.008603811264038086 s
Iter 5: 0.008523941040039062 s
Iter 6: 0.008649110794067383 s
Iter 7: 0.009306192398071289 s
Iter 8: 0.009340286254882812 s
Iter 9: 0.008769750595092773 s

The second best was:

with lhotse.audio_backend("AudioreadBackend"):
  for i in range(10):
    start = time.time()
    audio = recording.load_audio(offset=100, duration=5)
    end = time.time()
    print(f'Iter {i}: {end-start} s')

Output:

Iter 0: 0.24323129653930664 s
Iter 1: 0.20929312705993652 s
Iter 2: 0.23014044761657715 s
Iter 3: 0.2314000129699707 s
Iter 4: 0.21734905242919922 s
Iter 5: 0.1870877742767334 s
Iter 6: 0.19929170608520508 s
Iter 7: 0.22307038307189941 s
Iter 8: 0.2071971893310547 s
Iter 9: 0.20948362350463867 s

It's also possible to globally override the audio backend with lhotse.set_audio_backend or with LHOTSE_AUDIO_BACKEND env var (see this readme section).

@csukuangfj
Copy link
Contributor Author

Thank you for the suggestion! Yours is much better. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants