Slow in computing features for .m4a files #1460

csukuangfj · 2025-03-05T10:42:52Z

We have many *.m4a files, where each file is several hours long and contains multiple supervision segments.

We have used

lhotse/lhotse/cut/set.py

Line 1002 in acbca24

def trim_to_supervisions(

to split a long cutset into smaller segments.

During computing features of each segment, we find that it is very slow. After some debugging, we find

Looks like it uses

ffmpeg -i xxx.m4a  -f s16le -

to read the whole file xxx.m4a, which is not expected since we only need to extract a segment.

Also, our *.m4a files usually contain two channels. We hope that it reads only one channel from the file instead of all channels and discards unused channel(s) afterwards.

To help reproduce, we have prepared a colab notebook at
https://colab.research.google.com/drive/1ZU2M_z9kY503UKPYMuYX0BQ5BLuVi6jt?usp=sharing

sox -r 16000 -n -b 16 -c 1 a.wav synth 4320 sin 1000
ffmpeg -i a.wav a.m4a

it first generates a m4a file of length 4320 seconds.

import time
import lhotse

recording = lhotse.Recording.from_file('a.m4a')

then we construct a recording from it.

## Test loading the whole file
import time
for i in range(10):
  start = time.time()
  audio = recording.load_audio()
  end = time.time()
  print(f'Iter {i}: {end-start} s')

The output is given below

Iter 0: 2.1797096729278564 s
Iter 1: 2.187424659729004 s
Iter 2: 1.7927534580230713 s
Iter 3: 2.0300517082214355 s
Iter 4: 2.191190242767334 s
Iter 5: 1.7366209030151367 s
Iter 6: 1.7066655158996582 s
Iter 7: 1.7146885395050049 s
Iter 8: 1.7510318756103516 s
Iter 9: 1.6823420524597168 s

You can see that it takes about 2 seconds to load the whole file (4320 seconds).

## Test loading 100 seconds
import time
for i in range(10):
  start = time.time()
  audio = recording.load_audio(offset=200, duration=100)
  end = time.time()
  print(f'Iter {i}: {end-start} s')

The output is given below

Iter 0: 1.2440290451049805 s
Iter 1: 1.5040011405944824 s
Iter 2: 1.0565969944000244 s
Iter 3: 1.477067232131958 s
Iter 4: 1.03183913230896 s
Iter 5: 1.0384526252746582 s
Iter 6: 0.9423067569732666 s
Iter 7: 1.135765790939331 s
Iter 8: 1.038952350616455 s
Iter 9: 0.9517149925231934 s

You can see it takes comparable time as loading the whole file even if it requests loading 100s of the file.

## Test loading 5 seconds
import time
for i in range(10):
  start = time.time()
  audio = recording.load_audio(offset=100, duration=5)
  end = time.time()
  print(f'Iter {i}: {end-start} s')

The output is

Iter 0: 1.0234184265136719 s
Iter 1: 1.2550442218780518 s
Iter 2: 1.5072486400604248 s
Iter 3: 1.1016292572021484 s
Iter 4: 1.038039207458496 s
Iter 5: 0.9318227767944336 s
Iter 6: 1.1400001049041748 s
Iter 7: 1.0508427619934082 s
Iter 8: 1.0651192665100098 s
Iter 9: 1.0305383205413818 s

Even worse, reading only 5 seconds of the file also takes a long time.

If we use ffmpeg directly to read part of a file

for i in $(seq 10); do
  time ffmpeg -hide_banner -loglevel error -threads 1 -ss 200 -t 100 -i a.m4a -threads 1 100s.wav
  rm -f 100s.wav
done

The output is

real	0m0.172s
user	0m0.140s
sys	0m0.108s

real	0m0.171s
user	0m0.137s
sys	0m0.099s

real	0m0.175s
user	0m0.149s
sys	0m0.099s

real	0m0.169s
user	0m0.143s
sys	0m0.100s

real	0m0.173s
user	0m0.140s
sys	0m0.101s

real	0m0.200s
user	0m0.159s
sys	0m0.099s

real	0m0.168s
user	0m0.138s
sys	0m0.098s

real	0m0.172s
user	0m0.143s
sys	0m0.098s

real	0m0.163s
user	0m0.135s
sys	0m0.097s

real	0m0.172s
user	0m0.140s
sys	0m0.100s

You can see it is much faster.

This commit
csukuangfj@2dd8125
changes the current audio backed to use ffmpeg commandline, instead of torchaudio.load() to handle m4a files.

With the above commit, we have the following from the output of htop

The RTF on CPU for extracting features is about 0.0003-0.0008.

The text was updated successfully, but these errors were encountered:

pzelasko · 2025-03-05T15:01:17Z

Thanks for the detailed description. I managed to reproduce your issue. You can select a different existing audio backend that is more performant for m4a.

I got the best result with this:

with lhotse.audio_backend("FfmpegTorchaudioStreamerBackend"):
  for i in range(10):
    start = time.time()
    audio = recording.load_audio(offset=100, duration=5)
    end = time.time()
    print(f'Iter {i}: {end-start} s')

Output:

Iter 0: 0.014747142791748047 s
Iter 1: 0.011191129684448242 s
Iter 2: 0.009851455688476562 s
Iter 3: 0.00884866714477539 s
Iter 4: 0.008603811264038086 s
Iter 5: 0.008523941040039062 s
Iter 6: 0.008649110794067383 s
Iter 7: 0.009306192398071289 s
Iter 8: 0.009340286254882812 s
Iter 9: 0.008769750595092773 s

The second best was:

with lhotse.audio_backend("AudioreadBackend"):
  for i in range(10):
    start = time.time()
    audio = recording.load_audio(offset=100, duration=5)
    end = time.time()
    print(f'Iter {i}: {end-start} s')

Output:

Iter 0: 0.24323129653930664 s
Iter 1: 0.20929312705993652 s
Iter 2: 0.23014044761657715 s
Iter 3: 0.2314000129699707 s
Iter 4: 0.21734905242919922 s
Iter 5: 0.1870877742767334 s
Iter 6: 0.19929170608520508 s
Iter 7: 0.22307038307189941 s
Iter 8: 0.2071971893310547 s
Iter 9: 0.20948362350463867 s

It's also possible to globally override the audio backend with lhotse.set_audio_backend or with LHOTSE_AUDIO_BACKEND env var (see this readme section).

csukuangfj · 2025-03-06T01:48:31Z

Thank you for the suggestion! Yours is much better. Closing.

csukuangfj closed this as completed Mar 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow in computing features for .m4a files #1460

Slow in computing features for .m4a files #1460

csukuangfj commented Mar 5, 2025

pzelasko commented Mar 5, 2025

csukuangfj commented Mar 6, 2025

Slow in computing features for .m4a files #1460

Slow in computing features for .m4a files #1460

Comments

csukuangfj commented Mar 5, 2025

pzelasko commented Mar 5, 2025

csukuangfj commented Mar 6, 2025