Unknown words in the input: Predicting from the context #3

bartvm · 2015-12-02T21:07:20Z

One way of addressing unknown words in the input would be to predict the missing embedding from the context. This is effectively the same as saying that for these words we simply run a language model, and use that instead. Questions to look into:

What kind of language model? A simple language model, an RNN language model, or a bidirectional RNN language model
Whether or not we should pre-train this language model on monolingual data (it might spend a lot of capacity on learning words that are in the vocabulary, which is wasted capacity)

Reasons I think this could help:

Compared to getting embeddings from character level, this could give more sensible embeeddings to the encoder for e.g. proper nouns
Even if it only half-works, it's still better than feeding an UNK embedding, which could mean anything to the encoder (proper noun, rare word, typo, gibberish, etc.)

The text was updated successfully, but these errors were encountered:

anirudh9119 · 2016-02-03T04:02:37Z

I did a quick experiment to see, if the word embeddings from the context were actually helping or not, (which can also serve as a test to see if my implementation is correct or not)

I ran with and without the word embeddings from the context to see how much it helps, and I kept the vocabulary size to be 100, 500, 1000, 2000 and let the code run for 6 hours(both with and without contextual word embeddings) And the validation error was less with the contextual word embeddings when trained on Europarl only. (Difference was approximately 3%)

Interesting thing was difference b/w the validation error with and without contextual word embeddings was greater when I used dictionary size = 1000 as compared to 100 and 500, as expected.

So, Can I assume that word embeddings from the context is actually helping by this experiment ?

bartvm added the research label Dec 2, 2015

JinseokNam mentioned this issue Jan 11, 2016

Unknown words in the input: Predicting from the characters #4

Open

anirudh9119 self-assigned this Feb 3, 2016

anirudh9119 closed this as completed Feb 3, 2016

JinseokNam reopened this Feb 3, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unknown words in the input: Predicting from the context #3

Unknown words in the input: Predicting from the context #3

bartvm commented Dec 2, 2015

anirudh9119 commented Feb 3, 2016

Unknown words in the input: Predicting from the context #3

Unknown words in the input: Predicting from the context #3

Comments

bartvm commented Dec 2, 2015

anirudh9119 commented Feb 3, 2016