ADD soundchoice-g2p #59

kyakuno · 2024-06-23T01:04:32Z

Encoder
Decoder with Beam Search
Implement soundchoice-g2p ailia-models#1500

kyakuno · 2024-06-24T01:16:37Z

BERT Encoder

input_text To be or not to be, that is the question
txt_cleaned TO BE OR NOT TO BE THAT IS THE QUESTION
grapheme_encoded [ 0 22 17 30  4  7 30 17 20 30 16 17 22 30 22 17 30  4  7 30 22 10  3 22
 30 11 21 30 22 10  7 30 19 23  7 21 22 11 17 16]
input_ids [[ 101 2000 2022 2030 2025 2000 2022 1010 2008 2003 1996 3160  102]]
attention_mask [[1 1 1 1 1 1 1 1 1 1 1 1 1]]
token_type_ids [[0 0 0 0 0 0 0 0 0 0 0 0 0]]
hidden_states [[[[ 0.1685791  -0.28588867 -0.32592773 ... -0.02757263  0.03826904
     0.16394043]
hidden_states (13, 1, 13, 768)
word_emb [[ 5.42041016  2.01513672 -2.68774414 ...  0.62231445  0.51385498
   1.50268555]
 [ 5.42041016  2.01513672 -2.68774414 ...  0.62231445  0.51385498
   1.50268555]
output.shape (11, 768)

clean_pipelineは2つ以上の空白を1つの空白に変換し、大文字にする。
grapheme_encodedは下記のテーブルで変換する。

lab2ind = {
    # fmt: off
    '<bos>': 0, '<eos>': 1, '<unk>': 2, 'A': 3, 'B': 4, 'C': 5, 'D': 6, 'E': 7, 'F': 8, 'G': 9, 'H': 10, 'I': 11, 'J': 12, 'K': 13, 'L': 14, 'M': 15, 'N': 16, 'O': 17, 'P': 18, 'Q': 19, 'R': 20, 'S': 21, 'T': 22, 'U': 23, 'V': 24, 'W': 25, 'X': 26, 'Y': 27, 'Z': 28, "'": 29, ' ': 30
    # fmt: on
}

kyakuno · 2024-06-24T01:25:47Z

hidden_statesのshapeは(13, 1, 13, 768)になる。
13はトークン数。

word_ids [None, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, None]
token_ids_word [ 1  2  3  4  5  6  7  8  9 10 11]

下記のコードが不思議。

    # get_hidden_states
    layers = [-4, -3, -2, -1]
    output = np.sum(hidden_states[layers], axis=0)
    output = np.squeeze(output)
    output = output[token_ids_word]

元コードだと、hidden_stateの末尾4つを参照する構造になっている。
https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/wordemb/transformer.py#L247-L253

kyakuno · 2024-06-24T02:44:48Z

論理としては、hidden_stateの末尾4つを加算した上で、Special Tokenを除いたEmbeddingだけ取得している。

kyakuno · 2024-06-24T04:25:11Z

expand_to_charsでは、入力されたテキストの文字列を、spaceに相当するトークンでword_boundariesを検出し、word単位のembeddingから、文字単位のembeddingに変換する。

seq [[ 0 22 17 30  4  7 30 17 20 30 16 17 22 30 22 17 30  4  7 30 22 10  3 22
  30 11 21 30 22 10  7 30 19 23  7 21 22 11 17 16]]
word_boundaries [[False False False  True False False  True False False  True False False
  False  True False False  True False False  True False False False False
   True False False  True False False False  True False False False False
  False False False False]]
emb.shape (1, 11, 768)
words.shape (1, 40)
char_word_emb.shape (1, 40, 768)
char_word_emb [[[ 5.42041016  2.01513672 -2.68774414 ...  0.62231445  0.51385498
    1.50268555]
  [ 5.42041016  2.01513672 -2.68774414 ...  0.62231445  0.51385498
    1.50268555]
  [ 5.42041016  2.01513672 -2.68774414 ...  0.62231445  0.51385498
    1.50268555]
  ...

kyakuno · 2024-06-24T04:32:32Z

To be or not to be, that is the questionary
にすると、word_idsが重複する。questionaryがwordpieceで分割されるため。

この場合、tokens_ids_wordは下記のようになる。
tokens [101, 2000, 2022, 2030, 2025, 2000, 2022, 1010, 2008, 2003, 1996, 3160, 5649, 102]
word_ids [None, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, None]
token_ids_word [ 1 2 3 4 5 6 7 8 9 10 11 12]

emb.shape (1, 12, 768)
となり、embeddingはsubword単位で格納される。

このデータに対して、expand_to_charsをすると、ワード数が一致しなくなる気がする。

kyakuno · 2024-06-24T04:36:41Z

元のコードは下記。numpyに移植した際に、token_ids_wordの定義が間違っている気がする。
https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/wordemb/transformer.py

    def _get_word_vector(self, encoded, states, idx):
        token_ids_word = torch.from_numpy(
            np.where(np.array(encoded.word_ids()) == idx)[0]
        ).to(self.device)
        return self._get_hidden_states(states, token_ids_word)

kyakuno · 2024-06-24T08:56:03Z

"To be or not to be, that is the questionary"のように","が存在する場合、grapheme_pipelineで","が削除されるものの、BERTのTokenizerでは","を単語として扱うため、token_ids_wordの数と、expand_to_charsの中のseqを分割した数も合わなくなる。これも移植で入った問題かも。

kyakuno · 2024-06-24T09:06:42Z

token_ids_wordで連結単語の場合はカウントアップしない方が正しそう（C++に反映済み、Pythonも修正？）
,があるためseqのword数と、tokenizerのword数が合わない（Python版にも問題がある？）

kyakuno · 2024-06-24T12:25:04Z

expand_to_charsのvalueのdump。空白にはその次のwordのembeddingが入る。

        char_word_emb[idx] = emb[idx, item]

しかし、下記のロジックが何をしているかわからない。
item_lengthは49*49が入っており、そもそも、:49までしか値がないはずなのに、大きい値を代入している。
word_boundariesもTrue,Falseの配列で、それに代入している。

        char_word_emb[idx, item_length:, :] = 0
        char_word_emb[idx, word_boundaries[idx], :] = 0

オリジナルのコードは下記。
https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/wordemb/util.py

seq [[ 0 22 17 30  4  7 30 17 20 30 16 17 22 30 22 17 30  4  7 30 22 10  3 22
  30 11 21 30 22 10  7 30 19 23  7 21 22 11 17 16  3 20 27]]
emb.shape (1, 11, 768)
word_boundaries [[False False False  True False False  True False False  True False False
  False  True False False  True False False  True False False False False
   True False False  True False False False  True False False False False
  False False False False False False False]]
words.shape (1, 43)
words [[0 0 0 1 1 1 2 2 2 3 3 3 3 4 4 4 5 5 5 6 6 6 6 6 7 7 7 8 8 8 8 9 9 9 9 9
  9 9 9 9 9 9 9]]
char_word_emb.shape (1, 43, 768)
seq_len [43]
seq.shape[-1] 43
seq_len_idx [1849]
idx 0
item [0 0 0 1 1 1 2 2 2 3 3 3 3 4 4 4 5 5 5 6 6 6 6 6 7 7 7 8 8 8 8 9 9 9 9 9 9
 9 9 9 9 9 9]
item_length 1849
word_boundaries[idx] [False False False  True False False  True False False  True False False

    print("seq", seq)
    print("emb.shape", emb.shape)

    word_boundaries = seq == word_separator
    words = np.cumsum(word_boundaries, axis=-1)

    print("word_boundaries", word_boundaries)
    print("words.shape", words.shape)
    print("words", words)

    char_word_emb = np.zeros((emb.shape[0], seq.shape[-1], emb.shape[-1]))

    print("char_word_emb.shape", char_word_emb.shape)

    seq_len_idx = (seq_len * seq.shape[-1]).astype(int)
    print("seq_len", seq_len)
    print("seq.shape[-1]", seq.shape[-1])
    print("seq_len_idx", seq_len_idx)
    for idx, (item, item_length) in enumerate(zip(words, seq_len_idx)):
        print("idx", idx)
        print("item", item)
        print("item_length", item_length)
        print("word_boundaries[idx]", word_boundaries[idx])
        char_word_emb[idx] = emb[idx, item]
        char_word_emb[idx, item_length:, :] = 0
        char_word_emb[idx, word_boundaries[idx], :] = 0

    print("char_word_emb", char_word_emb)

kyakuno · 2024-06-24T12:35:33Z

word_boundaries[idx]を代入しているのは、空白部分には0を入れることを意図している。

kyakuno · 2024-06-24T12:46:20Z

attention

p_seq.shape (1, 1, 43)
p_seq [[[-5.957753   -6.236804   -5.8820686  -5.653797   -6.1083875
   -5.641102   -5.7392464  -6.0706687  -5.921131   -5.926258
   -5.885243   -5.5038943  -5.98119    -5.712879   -6.114613
   -6.3263593  -5.7158093  -5.697743   -5.784657   -5.2665906
   -6.007314   -5.9040413  -5.0927625  -5.5248904  -5.73534
   -5.4277234  -6.118397   -6.0178127  -5.7719617  -4.4199104
   -5.14159    -4.757801   -5.825917   -0.16014495 -5.8332405
   -5.837391   -5.4638557  -5.6127815  -5.728504   -5.7895393
   -5.420399   -5.895008   -6.436634  ]]]
encoder_outputs.shape (1, 43, 1024)
encoder_outputs [[[-5.9692383e-02 -2.2064209e-02 -3.2318115e-02 ... -2.9687500e-01
   -2.5768280e-03 -6.8740845e-03]
  [ 1.4819336e-01 -6.0363770e-02 -7.7197266e-01 ...  4.4647217e-02
    0.0000000e+00 -3.1530857e-05]
  [ 7.2998047e-02 -7.1960449e-02  1.3769531e-01 ... -3.6694336e-01
    1.1238098e-02 -8.0943108e-05]
  ...
  [ 8.6669922e-02  5.4382324e-02 -5.2880859e-01 ... -5.2880859e-01
    1.8322468e-04 -1.3113022e-06]
  [ 5.7324219e-01  4.9609375e-01  3.4887695e-01 ... -1.0604858e-02
    3.8290024e-04 -2.7418137e-04]
  [ 7.0166016e-01  3.2788086e-01  5.5615234e-01 ... -4.1015625e-01
    1.4219284e-03  3.2782555e-06]]]

kyakuno · 2024-06-25T00:07:48Z

OriginalのSpechBrainでのテスト。
想定通り、継続シンボルを持つトークンはスキップされる。
ただ、カンマを適切に扱えていないままな気はする。
textは,を入れて11シンボルあるのに、10シンボルとしてcharacterにコピーしてしまっている。

text = "To be or not to be, that is the question"

emb.shape torch.Size([1, 11, 768])
seq tensor([[ 0, 22, 17, 30,  4,  7, 30, 17, 20, 30, 16, 17, 22, 30, 22, 17, 30,  4,
          7, 30, 22, 10,  3, 22, 30, 11, 21, 30, 22, 10,  7, 30, 19, 23,  7, 21,
         22, 11, 17, 16]])
seq.shape torch.Size([1, 40])
words tensor([[0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 6, 6,
         7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9]])
words.shape torch.Size([1, 40])

text = "To be or not to be, that is the questionary"

emb.shape torch.Size([1, 12, 768])
seq tensor([[ 0, 22, 17, 30,  4,  7, 30, 17, 20, 30, 16, 17, 22, 30, 22, 17, 30,  4,
          7, 30, 22, 10,  3, 22, 30, 11, 21, 30, 22, 10,  7, 30, 19, 23,  7, 21,
         22, 11, 17, 16,  3, 20, 27]])
seq.shape torch.Size([1, 43])
words tensor([[0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 6, 6,
         7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9]])

kyakuno · 2024-06-25T00:14:36Z

現状のシステムはPunctuationは扱えないので、事前に置換すべきように見える。
speechbrain/speechbrain#2227
soundchoiceの公式のサンプルに与えてはいけない,を与えてしまっているだけのように見える。

kyakuno · 2024-06-25T00:20:32Z

Issueに上げてみた。
speechbrain/speechbrain#2580

kyakuno · 2024-06-25T00:29:22Z

推論の起点コード。
https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/inference/text.py

from speechbrain.inference.text import GraphemeToPhoneme
g2p = GraphemeToPhoneme.from_hparams("speechbrain/soundchoice-g2p", savedir="pretrained_models/soundchoice-g2p")
text = "To be or not to be, that is the question"
phonemes = g2p(text)

kyakuno · 2024-06-25T00:39:07Z

オリジナルのコードでも、emb.shapeが12になる。そもそも、元のコードの段階で、subword分割をうまく考慮できていない気がする。

To be or not to be, that is the questionary
emb.shape torch.Size([1, 12, 768])

kyakuno added the high priotity label Jun 23, 2024

kyakuno mentioned this issue Jun 24, 2024

[WIP] Implement soundchoice g2p #62

Open

kyakuno self-assigned this Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ADD soundchoice-g2p #59

ADD soundchoice-g2p #59

kyakuno commented Jun 23, 2024 •

edited

Loading

kyakuno commented Jun 24, 2024 •

edited

Loading

kyakuno commented Jun 24, 2024 •

edited

Loading

kyakuno commented Jun 24, 2024

kyakuno commented Jun 24, 2024

kyakuno commented Jun 24, 2024 •

edited

Loading

kyakuno commented Jun 24, 2024

kyakuno commented Jun 24, 2024 •

edited

Loading

kyakuno commented Jun 24, 2024

kyakuno commented Jun 24, 2024 •

edited

Loading

kyakuno commented Jun 24, 2024 •

edited

Loading

kyakuno commented Jun 24, 2024

kyakuno commented Jun 25, 2024 •

edited

Loading

kyakuno commented Jun 25, 2024 •

edited

Loading

kyakuno commented Jun 25, 2024

kyakuno commented Jun 25, 2024 •

edited

Loading

kyakuno commented Jun 25, 2024

ADD soundchoice-g2p #59

ADD soundchoice-g2p #59

Comments

kyakuno commented Jun 23, 2024 • edited Loading

kyakuno commented Jun 24, 2024 • edited Loading

kyakuno commented Jun 24, 2024 • edited Loading

kyakuno commented Jun 24, 2024

kyakuno commented Jun 24, 2024

kyakuno commented Jun 24, 2024 • edited Loading

kyakuno commented Jun 24, 2024

kyakuno commented Jun 24, 2024 • edited Loading

kyakuno commented Jun 24, 2024

kyakuno commented Jun 24, 2024 • edited Loading

kyakuno commented Jun 24, 2024 • edited Loading

kyakuno commented Jun 24, 2024

kyakuno commented Jun 25, 2024 • edited Loading

kyakuno commented Jun 25, 2024 • edited Loading

kyakuno commented Jun 25, 2024

kyakuno commented Jun 25, 2024 • edited Loading

kyakuno commented Jun 25, 2024

kyakuno commented Jun 23, 2024 •

edited

Loading

kyakuno commented Jun 24, 2024 •

edited

Loading

kyakuno commented Jun 24, 2024 •

edited

Loading

kyakuno commented Jun 24, 2024 •

edited

Loading

kyakuno commented Jun 24, 2024 •

edited

Loading

kyakuno commented Jun 24, 2024 •

edited

Loading

kyakuno commented Jun 24, 2024 •

edited

Loading

kyakuno commented Jun 25, 2024 •

edited

Loading

kyakuno commented Jun 25, 2024 •

edited

Loading

kyakuno commented Jun 25, 2024 •

edited

Loading