Gpu much slower #118

JRMeyer · 2021-03-08T00:48:00Z

JRMeyer
Mar 8, 2021
Maintainer

>>> btofel
[February 22, 2018, 11:06pm]

Am I missing something obvious, clearly I am. Perhaps I'm making some
fundamental mistake in understanding, but I expected the GPU version to
run an inference far faster than a CPU version. This is roughly 3 times
as long as the CPU version took for the same inference. Perhaps it's not
really engaging the GPUs? Is there a way to verify?

mail_reknewdeepdictation-1-gpu ~]$ deepspeech models/output_graph.pb data/recording2.wav models/alphabet.txt models/lm.binary models/trie
Loading model from file models/output_graph.pb
2018-02-22 23:01:22.906902: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-02-22 23:01:23.602419: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least
one NUMA node, so returning NUMA node zero
2018-02-22 23:01:23.602852: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:00:04.0
totalMemory: 15.90GiB freeMemory: 15.61GiB
2018-02-22 23:01:23.602891: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bu
s id: 0000:00:04.0, compute capability: 6.0)
Loaded model in 4.769s.
Loading language model from files models/lm.binary models/trie
Loaded language model in 11.281s.
Running inference.
my mom this is bread i am speaking as clearly as possible and of slowly as possible i hope you get this
Inference took 17.003s for 10.000s audio file.

[This is an archived TTS discussion thread from discourse.mozilla.org/t/gpu-much-slower]

JRMeyer · 2021-03-08T00:48:03Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> btofel
[February 23, 2018, 12:43am]

To partially answer my own question I guess this is just a case where
parallelization for a small task doesn't make sense? The setup involved
of copying data to GPU space, etc. is overwhelming any benefit. Right?

Is there a way to split out the timings to see the actual benefit of the
GPU for a small test like this?

[Archived Post]

0 replies

JRMeyer · 2021-03-08T00:48:06Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> lissyx
[February 23, 2018, 10:16am]

Use environment variable TF_CPP_MIN_LOG_LEVEL=1 or above values, this
will get you detailed informations of the computations, you'll be able
to know what is running on the GPU, and what is not.

How many GPUs are there? This shows only one.

[Archived Post]

0 replies

JRMeyer · 2021-03-08T00:48:08Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> yv001
[February 23, 2018, 8:19am]

This happens to me when deepspeech with GPU is run on one sample only.
When it runs as a part of a node.js server or using a modified python
script for batch processing, first call takes a long time but all
consecutive inferences take about 30-40% time of running the CPU
version.

E.g. while after the first request warm-up a 5 second audio takes about
5 seconds on CPU, it takes about 2 seconds on GPU.

[Archived Post]

0 replies

JRMeyer · 2021-03-08T00:48:11Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> lissyx
[February 23, 2018, 8:44am]

Yes, that's the other alternative. One easy way to verify this is to
check with the latest DeepSpeech artifacts from TaskCluster, and use the
mmap()'d version of the model: it requires to be converted from the
default output_graph.pb, using
https://index.taskcluster.net/v1/task/project.deepspeech.tensorflow.pip.r1.5.cpu/artifacts/public/convert_graphdef_memmapped_format

This also lowers a lot the heap memory requirements at runtime.

But still, 17secs for 10 secs of audio, even if you take out 5-6 secs
for models loading / parsing, on a P100, I would have expected much
faster inference.

[Archived Post]

0 replies

JRMeyer · 2021-03-08T00:48:13Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> btofel
[February 24, 2018, 8:23pm]

I guess I am not sure how to read the results. You say there is only one
GPU. There is one Tesla P100 card with presumably 3,584 cores. Is that
what you mean by one GPU?

[Archived Post]

0 replies

JRMeyer · 2021-03-08T00:48:16Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> btofel
[February 24, 2018, 8:19pm]

You wouldn't happen to be able to post a sample of that batch processing
Python script would you?

[Archived Post]

0 replies

JRMeyer · 2021-03-08T00:48:18Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> btofel
[February 24, 2018, 8:22pm]

I don't see any debugging or additional timing info when exporting that
flag. Should I be looking in a log file or does it come to STDOUT?

Here's what I did:

$ export TF_CPP_MIN_LOG_LEVEL=1
$ echo $TF_CPP_MIN_LOG_LEVEL
1
$ deepspeech models/output_graph.pb data/recording2.wav models/alphabet.txt models/lm.binary models/trie
Loading model from file models/output_graph.pb
...

[Archived Post]

0 replies

JRMeyer · 2021-03-08T00:48:21Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> btofel
[February 24, 2018, 8:43pm]

Is there some trick to creating a usable converted graph? I did:

$ ./convert_graphdef_memmapped_format --in_graph='models/output_graph.pb' --out_graph='models/output_graph.mmmapped_graph'
2018-02-24 20:42:34.555798: I tensorflow/contrib/util/convert_graphdef_memmapped_format_lib.cc:171] Converted 10 nodes
[mail_reknewdeepdictation-1-gpu ~]$ deepspeech models/output_graph.mmapped_graph data/recording2.wav models/alphabet.txt models/lm.binary models/trie
Loading model from file models/output_graph.mmapped_graph
Data loss: Can't parse models/output_graph.mmapped_graph as binary proto
Loaded model in 0.467s.
Loading language model from files models/lm.binary models/trie
Loaded language model in 2.421s.
Running inference.
Segmentation fault

[Archived Post]

0 replies

JRMeyer · 2021-03-08T00:48:24Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> lissyx
[February 25, 2018, 3:52pm]

Make sure you are using binaries from TaskCluster, master branch:
https://tools.taskcluster.net/index/project.deepspeech.deepspeech.native_client.master/gpu

The env variable might be also TF_CPP_MIN_VLOG_LEVEL, try values above
2, you should have lots of output on stderr, mentionning which device
the ops are running :).

Also, please export the mmap format graph to output_graph.pbmm, for
consistency with our codebase. The order of arguments has changed, the
wav file should be the latest one.

You can play with batching with deepspeech binary from
native_client.tar.xz: pass a directory instead of just one WAV file,
and add -t as latest argument.

[Archived Post]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gpu much slower #118

{{title}}

Replies: 9 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Gpu much slower #118

JRMeyer Mar 8, 2021 Maintainer

Replies: 9 comments

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author