BERTRAM: Improved Word Embeddings Have Big Impact on Contextualized Model Performance, Schick et al., 2019
Pretrained DL models have trouble understanding rare words. For static word embeddings, this problem has been addresesd by separately learning representations for rare words. In this paper we transfer this idea to pretrained language models.
We extrapolate from the form-context model. Atttentive mimicking uses BoW for processing contexts, which is reasonable for static embeddings, but not the best choice for deep LMs. The context part of FCM can capture the broad topic of rare words, but because it's a BoW model, it can't obtain a more concrete or detailed understanding. And the combination of form and context is done in a shallow way.
We introduce BERTRAM, an architecture based on BERT which is capable of inferring high-quality embeddings for rare words. It's a pretrained BERT model with AM. This is achieved by enabling the surface form and contexts of a word to interact with each other in a deep architecture.
For evaluation, rare words aren't well represented in commonly used downstream task datasets, so we include rarification, a procedure to automatically convert evaluation datasets into ones for which rare words are guaranteed to be important. We replace task-relevant frequent words with rare synonyms.
Our analysis showed that BERTRAM is beneficial not only for rare words (our main target in this paper), but also for frequent words.