Inference time on v100 seems slow #141

JRMeyer · 2021-03-08T00:55:42Z

JRMeyer
Mar 8, 2021
Maintainer

>>> markmp
[March 7, 2018, 5:37am]

Hi,

After much wrangling, I was able to get DeepSpeech working on the new
Amazon V100 Nvidia instances (p3.2xlarge). The inference seems quite
slow though - I seem to get .95x - 1.25x real-time in order to do
inference on V100...i.e. about 2 seconds for inference on a 2 second
audio clip. This is a top card, and this seems much slower than what
others seem to be reporting (closer to 0.3x-0.4x).

For comparison, the CPU seems to take 2x to 2.5x to do the same
inference. Really surprised the V100 isn't performing better and I'm
wondering if I'm doing something suboptimal.

Getting the CPU inference going was fine - but the GPU inference was
very frustrating because of mismatches between tensorflow, deepspeech,
cuda, and cudnn. In the end, the only config I could get running was: slash
pip install slash 'tensorflow-gpu==1.5.0' slash
pip install deepspeech-gpu (the pypi package doesn't work - so I used
the artifact here:
https://tools.taskcluster.net/index/project.deepspeech.deepspeech.native_client.master/gpu) slash
manually install Cuda 9.0 slash
manually install cudnn 7.0.5

Is anyone getting faster performance on the amazon v100? Or is .95x the
best I can hope for?

thx

[This is an archived TTS discussion thread from discourse.mozilla.org/t/inference-time-on-v100-seems-slow]

JRMeyer · 2021-03-08T00:55:45Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> markmp
[March 7, 2018, 6:59am]

Ok - update - I converted to mmap format using native client tool
located at
(https://tools.taskcluster.net/index/project.deepspeech.deepspeech.native_client.master/gpu)

The initial inference is about the same - but subsequent inferences are
now much better at about 0.23x real-time running on the V100. Does this
seem right to folks?

[Archived Post]

0 replies

JRMeyer · 2021-03-08T00:55:47Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> lissyx
[March 7, 2018, 8:27am]

0.23x seems much better, but I'm surprised the first inference would
still be so slow. Maybe some setup-stuff, loading the model on the GPU ?

[Archived Post]

0 replies

JRMeyer · 2021-03-08T00:55:50Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> lissyx
[March 7, 2018, 8:32am]

I remember there was a specific patch to apply for those Volta GPUs as
well. We do not build with them, I don't know if it as an impact at
runtime or if it is enough to just setup all the patches when you
install the inference tooling.

Patches are at
https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1604&target_type=runfilelocal
(for ubuntu/16.04/amd64), and the first one is explicitely to improve
performances on Volta GPUs.

[Archived Post]

0 replies

JRMeyer · 2021-03-08T00:55:52Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> lissyx
[March 7, 2018, 2:29pm]

I've rented the same instance, and I'm running some experiments. This is
on Ubuntu 16.04 provided by Amazon, and installed nvidia-384 driver.
Then hand-installed CUDA 9.0 + CuDNN v7.

The first run took 30-45 secs before (expected) failure. I don't really
know why. Subsequent runs seems nicer:

ubuntuip-172-31-32-91:/ds/gpu$ time ./deepspeech ../models/output_graph.pb ../models/alphabet.txt ../audio/ -t
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2018-03-07 14:20:41.678897: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-03-07 14:20:41.786354: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-03-07 14:20:41.786738: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
totalMemory: 15.77GiB freeMemory: 15.35GiB
2018-03-07 14:20:41.786766: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
Running on directory ../audio/
> ../audio//8455-210777-0068.wav
your powr is sufficient i said
cpu_time_overall=2.45969 cpu_time_mfcc=0.00468 cpu_time_infer=2.45501
> ../audio//4507-16021-0012.wav
why should one halt on the way
cpu_time_overall=0.37646 cpu_time_mfcc=0.00452 cpu_time_infer=0.37194
> ../audio//2830-3980-0043.wav
experience proves tis
cpu_time_overall=0.28520 cpu_time_mfcc=0.00326 cpu_time_infer=0.28194

real 0m4.124s
user 0m2.704s
sys 0m1.524s

Runs with mmap() are even nicer:

ubuntuip-172-31-32-91:/ds/gpu$ time ./deepspeech ../models/output_graph.pbmm ../models/alphabet.txt ../audio/ -t
2018-03-07 14:21:25.242678: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-03-07 14:21:25.342985: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-03-07 14:21:25.343375: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
totalMemory: 15.77GiB freeMemory: 15.35GiB
2018-03-07 14:21:25.343403: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
Running on directory ../audio/
> ../audio//8455-210777-0068.wav
your powr is sufficient i said
cpu_time_overall=0.71324 cpu_time_mfcc=0.00432 cpu_time_infer=0.70892
> ../audio//4507-16021-0012.wav
why should one halt on the way
cpu_time_overall=0.45870 cpu_time_mfcc=0.00454 cpu_time_infer=0.45416
> ../audio//2830-3980-0043.wav
experience proves tis
cpu_time_overall=0.33660 cpu_time_mfcc=0.00332 cpu_time_infer=0.33327

real 0m2.201s
user 0m1.588s
sys 0m0.732s

Speaking in term of realtime factor, after a few runs, I get those
stable (low variation) values:

file audio length cpu_infer_time rt factor
--------------------------------- -------------- ---------------- -- -------------------
.../audio//8455-210777-0068.wav 1,975 0,70892 0,358946835443038
.../audio//4507-16021-0012.wav 2,735 0,45416 0,166054844606947
.../audio//2830-3980-0043.wav 2,59 0,33327 0,128675675675676

This is with CUDA 9.0 / CuDNN v7.

[Archived Post]

0 replies

JRMeyer · 2021-03-08T00:55:55Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> lissyx
[March 7, 2018, 2:27pm]

Aside, you don't have to scratch your head: no need for tensorflow-gpu
if you don't train. What was problematic with CUDA and CuDNN ? We assume
that people wanting to run GPU-enabled version have already properly
configured their system.

[Archived Post]

0 replies

JRMeyer · 2021-03-08T00:55:57Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> lissyx
[March 7, 2018, 2:44pm]

After applying patches 1 and 2 from
https://developer.nvidia.com/cuda-90-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1604&target_type=runfilelocal
I'm getting mostly the same results.

[ slash mark](
now is that our model does not benefit from the extra power of V100 on
the small data we ingest. I don't see any reference to the audio files
you used to perform your tests. What's their length ? Testing with a
5.55secs audio file I'm getting slash ~0.16x, using .pbmm, and slash ~0.155x
using .pb, and when doing multiple inferences.

Doing just one-shot on the same file gets me slash ~0.23x with .pbmm, so
close to your results.

[Archived Post]

0 replies

JRMeyer · 2021-03-08T00:56:00Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> markmp
[March 13, 2018, 12:39am]

These patches on Volta were for 9.1 I believe, not 9.0. The package for
tensorflow wouldn't work with 9.1 - but we could probably build TF
against that to see if that patch improves perf.

[Archived Post]

0 replies

JRMeyer · 2021-03-08T00:56:03Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> markmp
[March 13, 2018, 12:41am]

Good that you're getting same ballpark numbers now.

Fwiw, I always noticed that the first tensorflow run on an AWS P3 takes
about a minute to get going. This has been true for all my TF projects,
not just deepspeech. Not sure why, but my guess is how the instance
connects to the actual underlying hardware.

[Archived Post]

0 replies

JRMeyer · 2021-03-08T00:56:05Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> markmp
[March 13, 2018, 12:44am]

good to know - it was the 3 audio files included in the 0.1.1 release.
So looks like basically the same result. There might be slight speed
increase in your case since you're using the option to point at
directory instead of individual file. My version of deep speech didn't
seem to support that.

[Archived Post]

0 replies

JRMeyer · 2021-03-08T00:56:08Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> markmp
[March 13, 2018, 12:45am]

Ok, total noob question - but why is tensorflow-gpu not needed for GPU
inference (not just training)? When I tried just tensorflow package I
seemed to be getting 2x real-time on the CPU.

[Archived Post]

0 replies

JRMeyer · 2021-03-08T00:56:10Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> lissyx
[March 13, 2018, 9:45am]

Because tensorflow-gpu has nothing to do with deepspeech-gpu
![:slight_smile:](

[Archived Post]

0 replies

JRMeyer · 2021-03-08T00:56:13Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> markmp
[March 13, 2018, 7:56pm]

Do you mean that deepspeech-gpu is completely statically linked and
doesn't require any shared libs from TF?

[Archived Post]

0 replies

JRMeyer · 2021-03-08T00:56:16Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> lissyx
[March 13, 2018, 9:23pm]

Yes, that's it, everything is there.

[Archived Post]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference time on v100 seems slow #141

{{title}}

Replies: 13 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Inference time on v100 seems slow #141

JRMeyer Mar 8, 2021 Maintainer

Replies: 13 comments

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author