How exactly the decoder and especially tf nn ctc beam search decoder works #201

JRMeyer · 2021-03-08T01:12:43Z

JRMeyer
Mar 8, 2021
Maintainer

>>> monakons
[May 20, 2018, 8:32pm]

I know that this also a tensorfow question, but tensorflow has poor
documentation on this.

This is what I understand from some external documentation.

In general this layer is capable to (1) Implement a softmax layer to
convert the output to a probability distribution over symbols (2)
eliminate repeated characters between the blank symbols that comes from
the acoustic model (3) Implement beam search on a prefix (character)
tree and extract the most probable sequence (4) Optionally this output
can be fed into the language model that is responsible to 'correct' this
output in word level based on known word sequences.

If the above are correct my questions are :

slash (1 slash ) Where is the prefix tree that tf.nn.ctc_beam_search_decoder
includes ? Or it does not ? slash
(2) Is the Language Model indeed used only in word level? Is it possible
to be used in character level? slash
(3) I removed the tf.nn.ctc_beam_search_decoder from the protobuf file
and implement it later on the pipeline, however, the results using
tf.nn.ctc_beam_search_decoder inside and outside the protobuf are
different (why is that happening ?? ) slash
(4) Is the Language model necessary while training ? Isn't the purpose
to run the LM on top of the Acoustic model as a standalone module ??

Thanks in advance !

[This is an archived TTS discussion thread from discourse.mozilla.org/t/how-exactly-the-decoder-and-especially-tf-nn-ctc-beam-search-decoder-works]

JRMeyer · 2021-03-08T01:12:46Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> reuben
[May 21, 2018, 12:04pm]

This paper by Graves is a good reference:
http://proceedings.mlr.press/v32/graves14.pdf

1. tf.nn.ctc_beam_search_decoder does not include LM scoring, so
there's no prefix tree. In the language of Graves' paper, it fixes
Pr(k slash |y)=1.
2. You can implement both strategies.
3. Hard to tell without knowing what you did exactly.
4. It's only used for decoding, so only needed for WER reports that
show you the actual decoded strings. If you train with
display_step=0 and notest, the LM will never be used.

[Archived Post]

0 replies

JRMeyer · 2021-03-08T01:12:48Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> monakons
[May 21, 2018, 11:17am]

Thanks for the reply,

slash (1 slash ) So the output is just the symbol with the maximum probability ? TF
native tf.nn.ctc_beam_search_decoder does not have support for a
Language Model but you can set the beam_width parameter (e.g.
deepspeech use 1024, or you can use 1 for greedy search). What exactly
this layer searches on ?? It should include some transcript
probabilistic model or something ...

slash (2 slash ) Ok

slash (3 slash ) In this case I first set the output of the protobuf model, to
extract logits and fed this output to the decoder, exactly as
described in create_inference_graph. The odd think is that I am
getting different results (predicted transcript) when the decoder is in
or out of the protobuf.

slash (4 slash ) I can see that while training the deepspeech script calls
calculate_mean_edit_distance_and_loss that includes the language model
while calculating the output. So at each iteration the error will be the
difference of the model prediction (with LM) and the ground truth. So,
my question still is, isn't the language model supposed to be used only
on evaliation of the model, as a standalone module, that 'corrects' the
acoustic model output ??

Thanks.

[Archived Post]

0 replies

JRMeyer · 2021-03-08T01:12:51Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> reuben
[May 21, 2018, 11:31am]

You should really read the CTC paper
![:slight_smile:](

1. The output of the acoustic model is itself a matrix of transition
probabilities, you're searching for paths of maximum combined
probability. That is not the same as greedy (argmax) decoding.

2. Yeah that seems very weird. In the native clients we do exactly
that, extracting logits and decoding on the client, and it works
fine.

3. You're misreading the code, the decode operation is defined in
calculate_mean_edit_distance_and_loss, but it's only used for WER
reports, so only on validation and test sets. The loss computation
does not involve decoding the output at all.

[Archived Post]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How exactly the decoder and especially tf nn ctc beam search decoder works #201

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

How exactly the decoder and especially tf nn ctc beam search decoder works #201

JRMeyer Mar 8, 2021 Maintainer

Replies: 3 comments

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author