diff --git a/chapter_natural-language-processing-pretraining/bert-dataset.md b/chapter_natural-language-processing-pretraining/bert-dataset.md
new file mode 100644
index 000000000..002418c42
--- /dev/null
+++ b/chapter_natural-language-processing-pretraining/bert-dataset.md
@@ -0,0 +1,359 @@
+# 预训练数据集 BERT
+:label:`sec_bert-dataset`
+
+要预训练在 :numref:`sec_bert` 中实施的 BERT 模型，我们需要以理想的格式生成数据集，以便于完成两项预训练任务：蒙版语言建模和下一句预测。一方面，原始的 BERT 模型是在两个巨大的语库 Bookcorpus 和英语维基百科（见 :numref:`subsec_bert_pretraining_tasks`）的连接方面进行了预训练，这使得这本书的大多数读者难以运行。另一方面，现成的预训练 BERT 模型可能不适合来自医学等特定领域的应用。因此，在自定义数据集上对 BERT 进行预训练越来越受欢迎。为了促进 BERT 预训练的演示，我们使用了较小的语料库 Wikitext-2 :cite:`Merity.Xiong.Bradbury.ea.2016`。 
+
+与 :numref:`sec_word2vec_data` 中用于预训 word2vec 的 PTB 数据集相比，WikiText-2 (i) 保留了原始标点符号，使其适合于下一句预测；(ii) 保留原始大小写和数字；(iii) 大两倍以上。
+
+```{.python .input}
+from d2l import mxnet as d2l
+from mxnet import gluon, np, npx
+import os
+import random
+
+npx.set_np()
+```
+
+```{.python .input}
+#@tab pytorch
+from d2l import torch as d2l
+import os
+import random
+import torch
+```
+
+在 Wikitext-2 数据集中，每行代表一个段落，其中任何标点符号和前面的标记之间插入空格。保留至少有两句话的段落。为了分割句子，为了简单起见，我们只使用句点作为分隔符。我们将在本节末尾的练习中讨论更复杂的句子分割技术。
+
+```{.python .input}
+#@tab all
+#@save
+d2l.DATA_HUB['wikitext-2'] = (
+    'https://s3.amazonaws.com/research.metamind.io/wikitext/'
+    'wikitext-2-v1.zip', '3c914d17d80b1459be871a5039ac23e752a53cbe')
+
+#@save
+def _read_wiki(data_dir):
+    file_name = os.path.join(data_dir, 'wiki.train.tokens')
+    with open(file_name, 'r') as f:
+        lines = f.readlines()
+    # Uppercase letters are converted to lowercase ones
+    paragraphs = [line.strip().lower().split(' . ')
+                  for line in lines if len(line.split(' . ')) >= 2]
+    random.shuffle(paragraphs)
+    return paragraphs
+```
+
+## 为预训任务定义助手函数
+
+在下面，我们首先为两个 BERT 预训练任务实施辅助函数：下一句预测和蒙版语言建模。稍后将原始文本语料库转换为预训练 BERT 的理想格式的数据集时，将调用这些辅助函数。 
+
+### 生成下一句预测任务
+
+根据 :numref:`subsec_nsp` 的描述，`_get_next_sentence` 函数生成了二进制分类任务的训练示例。
+
+```{.python .input}
+#@tab all
+#@save
+def _get_next_sentence(sentence, next_sentence, paragraphs):
+    if random.random() < 0.5:
+        is_next = True
+    else:
+        # `paragraphs` is a list of lists of lists
+        next_sentence = random.choice(random.choice(paragraphs))
+        is_next = False
+    return sentence, next_sentence, is_next
+```
+
+以下函数通过调用 `_get_next_sentence` 函数生成从输入 `paragraph` 进行下一句预测的训练示例。这里 `paragraph` 是一个句子列表，其中每句都是一个令牌列表。参数 `max_len` 指定了预训期间 BERT 输入序列的最大长度。
+
+```{.python .input}
+#@tab all
+#@save
+def _get_nsp_data_from_paragraph(paragraph, paragraphs, vocab, max_len):
+    nsp_data_from_paragraph = []
+    for i in range(len(paragraph) - 1):
+        tokens_a, tokens_b, is_next = _get_next_sentence(
+            paragraph[i], paragraph[i + 1], paragraphs)
+        # Consider 1 '<cls>' token and 2 '<sep>' tokens
+        if len(tokens_a) + len(tokens_b) + 3 > max_len:
+            continue
+        tokens, segments = d2l.get_tokens_and_segments(tokens_a, tokens_b)
+        nsp_data_from_paragraph.append((tokens, segments, is_next))
+    return nsp_data_from_paragraph
+```
+
+### 生成蒙版语言建模任务
+:label:`subsec_prepare_mlm_data`
+
+为了根据 BERT 输入序列生成蒙版语言建模任务的训练示例，我们定义了以下 `_replace_mlm_tokens` 函数。在其输入中，`tokens` 是代表 BERT 输入序列的令牌列表，`candidate_pred_positions` 是 BERT 输入序列的令牌索引列表，不包括特殊令牌的索引（在蒙屏语言建模任务中没有预测特殊令牌），`num_mlm_preds` 表示预测数量（回想 15％随机令牌可以预测）。在 :numref:`subsec_mlm` 中对蒙版语言建模任务的定义之后，在每个预测位置，输入可以被特殊的 “<mask>” 令牌或随机令牌替换，或者保持不变。最后，该函数在可能的替换后返回输入令牌、发生预测的令牌索引以及这些预测的标签。
+
+```{.python .input}
+#@tab all
+#@save
+def _replace_mlm_tokens(tokens, candidate_pred_positions, num_mlm_preds,
+                        vocab):
+    # Make a new copy of tokens for the input of a masked language model,
+    # where the input may contain replaced '<mask>' or random tokens
+    mlm_input_tokens = [token for token in tokens]
+    pred_positions_and_labels = []
+    # Shuffle for getting 15% random tokens for prediction in the masked
+    # language modeling task
+    random.shuffle(candidate_pred_positions)
+    for mlm_pred_position in candidate_pred_positions:
+        if len(pred_positions_and_labels) >= num_mlm_preds:
+            break
+        masked_token = None
+        # 80% of the time: replace the word with the '<mask>' token
+        if random.random() < 0.8:
+            masked_token = '<mask>'
+        else:
+            # 10% of the time: keep the word unchanged
+            if random.random() < 0.5:
+                masked_token = tokens[mlm_pred_position]
+            # 10% of the time: replace the word with a random word
+            else:
+                masked_token = random.randint(0, len(vocab) - 1)
+        mlm_input_tokens[mlm_pred_position] = masked_token
+        pred_positions_and_labels.append(
+            (mlm_pred_position, tokens[mlm_pred_position]))
+    return mlm_input_tokens, pred_positions_and_labels
+```
+
+通过调用上述 `_replace_mlm_tokens` 函数，以下函数将 BERT 输入序列 (`tokens`) 作为输入序列 (`tokens`) 并返回输入令牌的索引（在可能的令牌替换后，如 :numref:`subsec_mlm` 所述）、发生预测的令牌指数以及标记这些指数的索引预测。
+
+```{.python .input}
+#@tab all
+#@save
+def _get_mlm_data_from_tokens(tokens, vocab):
+    candidate_pred_positions = []
+    # `tokens` is a list of strings
+    for i, token in enumerate(tokens):
+        # Special tokens are not predicted in the masked language modeling
+        # task
+        if token in ['<cls>', '<sep>']:
+            continue
+        candidate_pred_positions.append(i)
+    # 15% of random tokens are predicted in the masked language modeling task
+    num_mlm_preds = max(1, round(len(tokens) * 0.15))
+    mlm_input_tokens, pred_positions_and_labels = _replace_mlm_tokens(
+        tokens, candidate_pred_positions, num_mlm_preds, vocab)
+    pred_positions_and_labels = sorted(pred_positions_and_labels,
+                                       key=lambda x: x[0])
+    pred_positions = [v[0] for v in pred_positions_and_labels]
+    mlm_pred_labels = [v[1] for v in pred_positions_and_labels]
+    return vocab[mlm_input_tokens], pred_positions, vocab[mlm_pred_labels]
+```
+
+## 将文本转换为训练前数据集
+
+现在我们已经准备好自定义 `Dataset` 课程，用于预训 BERT。在此之前，我们仍然需要定义一个助手函数 `_pad_bert_inputs` 来将特殊的 “<mask>” 令牌附加到输入中。它的论点 `examples` 包含了帮助函数 `_get_nsp_data_from_paragraph` 和 `_get_mlm_data_from_tokens` 用于两个预训任务的输出。
+
+```{.python .input}
+#@save
+def _pad_bert_inputs(examples, max_len, vocab):
+    max_num_mlm_preds = round(max_len * 0.15)
+    all_token_ids, all_segments, valid_lens,  = [], [], []
+    all_pred_positions, all_mlm_weights, all_mlm_labels = [], [], []
+    nsp_labels = []
+    for (token_ids, pred_positions, mlm_pred_label_ids, segments,
+         is_next) in examples:
+        all_token_ids.append(np.array(token_ids + [vocab['<pad>']] * (
+            max_len - len(token_ids)), dtype='int32'))
+        all_segments.append(np.array(segments + [0] * (
+            max_len - len(segments)), dtype='int32'))
+        # `valid_lens` excludes count of '<pad>' tokens
+        valid_lens.append(np.array(len(token_ids), dtype='float32'))
+        all_pred_positions.append(np.array(pred_positions + [0] * (
+            max_num_mlm_preds - len(pred_positions)), dtype='int32'))
+        # Predictions of padded tokens will be filtered out in the loss via
+        # multiplication of 0 weights
+        all_mlm_weights.append(
+            np.array([1.0] * len(mlm_pred_label_ids) + [0.0] * (
+                max_num_mlm_preds - len(pred_positions)), dtype='float32'))
+        all_mlm_labels.append(np.array(mlm_pred_label_ids + [0] * (
+            max_num_mlm_preds - len(mlm_pred_label_ids)), dtype='int32'))
+        nsp_labels.append(np.array(is_next))
+    return (all_token_ids, all_segments, valid_lens, all_pred_positions,
+            all_mlm_weights, all_mlm_labels, nsp_labels)
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+def _pad_bert_inputs(examples, max_len, vocab):
+    max_num_mlm_preds = round(max_len * 0.15)
+    all_token_ids, all_segments, valid_lens,  = [], [], []
+    all_pred_positions, all_mlm_weights, all_mlm_labels = [], [], []
+    nsp_labels = []
+    for (token_ids, pred_positions, mlm_pred_label_ids, segments,
+         is_next) in examples:
+        all_token_ids.append(torch.tensor(token_ids + [vocab['<pad>']] * (
+            max_len - len(token_ids)), dtype=torch.long))
+        all_segments.append(torch.tensor(segments + [0] * (
+            max_len - len(segments)), dtype=torch.long))
+        # `valid_lens` excludes count of '<pad>' tokens
+        valid_lens.append(torch.tensor(len(token_ids), dtype=torch.float32))
+        all_pred_positions.append(torch.tensor(pred_positions + [0] * (
+            max_num_mlm_preds - len(pred_positions)), dtype=torch.long))
+        # Predictions of padded tokens will be filtered out in the loss via
+        # multiplication of 0 weights
+        all_mlm_weights.append(
+            torch.tensor([1.0] * len(mlm_pred_label_ids) + [0.0] * (
+                max_num_mlm_preds - len(pred_positions)),
+                dtype=torch.float32))
+        all_mlm_labels.append(torch.tensor(mlm_pred_label_ids + [0] * (
+            max_num_mlm_preds - len(mlm_pred_label_ids)), dtype=torch.long))
+        nsp_labels.append(torch.tensor(is_next, dtype=torch.long))
+    return (all_token_ids, all_segments, valid_lens, all_pred_positions,
+            all_mlm_weights, all_mlm_labels, nsp_labels)
+```
+
+将用于生成两个预训练任务的训练示例的帮助函数和用于填充输入的辅助函数放在一起，我们将以下 `_WikiTextDataset` 类定制为用于预训练 BERT 的 WikiText-2 数据集。通过实现 `__getitem__ ` 函数，我们可以任意访问来自 WikiText-2 语料库的一对句子生成的预训练（蒙面语言建模和下一句预测）示例。 
+
+原来的 BERT 模型使用字体嵌入，其词汇量为 30000 :cite:`Wu.Schuster.Chen.ea.2016`。WordPiece 的标记化方法是对 :numref:`subsec_Byte_Pair_Encoding` 中原来的字节对编码算法的轻微修改。为简单起见，我们使用 `d2l.tokenize` 函数进行标记化。出现少于五次的罕见代币将被过滤掉。
+
+```{.python .input}
+#@save
+class _WikiTextDataset(gluon.data.Dataset):
+    def __init__(self, paragraphs, max_len):
+        # Input `paragraphs[i]` is a list of sentence strings representing a
+        # paragraph; while output `paragraphs[i]` is a list of sentences
+        # representing a paragraph, where each sentence is a list of tokens
+        paragraphs = [d2l.tokenize(
+            paragraph, token='word') for paragraph in paragraphs]
+        sentences = [sentence for paragraph in paragraphs
+                     for sentence in paragraph]
+        self.vocab = d2l.Vocab(sentences, min_freq=5, reserved_tokens=[
+            '<pad>', '<mask>', '<cls>', '<sep>'])
+        # Get data for the next sentence prediction task
+        examples = []
+        for paragraph in paragraphs:
+            examples.extend(_get_nsp_data_from_paragraph(
+                paragraph, paragraphs, self.vocab, max_len))
+        # Get data for the masked language model task
+        examples = [(_get_mlm_data_from_tokens(tokens, self.vocab)
+                      + (segments, is_next))
+                     for tokens, segments, is_next in examples]
+        # Pad inputs
+        (self.all_token_ids, self.all_segments, self.valid_lens,
+         self.all_pred_positions, self.all_mlm_weights,
+         self.all_mlm_labels, self.nsp_labels) = _pad_bert_inputs(
+            examples, max_len, self.vocab)
+
+    def __getitem__(self, idx):
+        return (self.all_token_ids[idx], self.all_segments[idx],
+                self.valid_lens[idx], self.all_pred_positions[idx],
+                self.all_mlm_weights[idx], self.all_mlm_labels[idx],
+                self.nsp_labels[idx])
+
+    def __len__(self):
+        return len(self.all_token_ids)
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+class _WikiTextDataset(torch.utils.data.Dataset):
+    def __init__(self, paragraphs, max_len):
+        # Input `paragraphs[i]` is a list of sentence strings representing a
+        # paragraph; while output `paragraphs[i]` is a list of sentences
+        # representing a paragraph, where each sentence is a list of tokens
+        paragraphs = [d2l.tokenize(
+            paragraph, token='word') for paragraph in paragraphs]
+        sentences = [sentence for paragraph in paragraphs
+                     for sentence in paragraph]
+        self.vocab = d2l.Vocab(sentences, min_freq=5, reserved_tokens=[
+            '<pad>', '<mask>', '<cls>', '<sep>'])
+        # Get data for the next sentence prediction task
+        examples = []
+        for paragraph in paragraphs:
+            examples.extend(_get_nsp_data_from_paragraph(
+                paragraph, paragraphs, self.vocab, max_len))
+        # Get data for the masked language model task
+        examples = [(_get_mlm_data_from_tokens(tokens, self.vocab)
+                      + (segments, is_next))
+                     for tokens, segments, is_next in examples]
+        # Pad inputs
+        (self.all_token_ids, self.all_segments, self.valid_lens,
+         self.all_pred_positions, self.all_mlm_weights,
+         self.all_mlm_labels, self.nsp_labels) = _pad_bert_inputs(
+            examples, max_len, self.vocab)
+
+    def __getitem__(self, idx):
+        return (self.all_token_ids[idx], self.all_segments[idx],
+                self.valid_lens[idx], self.all_pred_positions[idx],
+                self.all_mlm_weights[idx], self.all_mlm_labels[idx],
+                self.nsp_labels[idx])
+
+    def __len__(self):
+        return len(self.all_token_ids)
+```
+
+通过使用 `_read_wiki` 函数和 `_WikiTextDataset` 类，我们定义了下载以下 `load_data_wiki` 和 WikiText-2 数据集并从中生成预训示例。
+
+```{.python .input}
+#@save
+def load_data_wiki(batch_size, max_len):
+    """Load the WikiText-2 dataset."""
+    num_workers = d2l.get_dataloader_workers()
+    data_dir = d2l.download_extract('wikitext-2', 'wikitext-2')
+    paragraphs = _read_wiki(data_dir)
+    train_set = _WikiTextDataset(paragraphs, max_len)
+    train_iter = gluon.data.DataLoader(train_set, batch_size, shuffle=True,
+                                       num_workers=num_workers)
+    return train_iter, train_set.vocab
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+def load_data_wiki(batch_size, max_len):
+    """Load the WikiText-2 dataset."""
+    num_workers = d2l.get_dataloader_workers()
+    data_dir = d2l.download_extract('wikitext-2', 'wikitext-2')
+    paragraphs = _read_wiki(data_dir)
+    train_set = _WikiTextDataset(paragraphs, max_len)
+    train_iter = torch.utils.data.DataLoader(train_set, batch_size,
+                                        shuffle=True, num_workers=num_workers)
+    return train_iter, train_set.vocab
+```
+
+将批次大小设置为 512，并且 BERT 输入序列的最大长度为 64，我们会打印出一个小批量的 BERT 预训练示例的形状。请注意，在每个 BERT 输入序列中，为蒙版语言建模任务预测 $10$ ($64 \times 0.15$) 的位置。
+
+```{.python .input}
+#@tab all
+batch_size, max_len = 512, 64
+train_iter, vocab = load_data_wiki(batch_size, max_len)
+
+for (tokens_X, segments_X, valid_lens_x, pred_positions_X, mlm_weights_X,
+     mlm_Y, nsp_y) in train_iter:
+    print(tokens_X.shape, segments_X.shape, valid_lens_x.shape,
+          pred_positions_X.shape, mlm_weights_X.shape, mlm_Y.shape,
+          nsp_y.shape)
+    break
+```
+
+最后，让我们来看看词汇量的大小。即使在过滤掉不常见的令牌之后，它仍比 PTB 数据集大两倍以上。
+
+```{.python .input}
+#@tab all
+len(vocab)
+```
+
+## 摘要
+
+* 与 PTB 数据集相比，Wikitext-2 日期集保留了原始标点符号、大小写和数字，并且大两倍以上。
+* 我们可以任意访问来自 WikiText-2 语料库的一对句子生成的预训练（蒙面语言建模和下一句预测）示例。
+
+## 练习
+
+1. 为简单起见，句点被用作分割句子的唯一分隔符。尝试其他句子拆分技术，例如 SPacy 和 NLTK。以 NLTK 为例。你需要先安装 NLTK：`pip install nltk`。在代码中，首先是 `import nltk`。然后，下载 Punkt 句子分词器：`nltk.download('punkt')`。分割诸如 `句子 = '这太棒了！为什么不呢？'`, invoking `nltk.tokenize.sent_tokenize（句子）` will return a list of two sentence strings: ` [“这太棒了！”，“为什么不？”]`。
+1. 如果我们不过滤掉任何不常见的令牌，词汇量是多少？
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/389)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/1496)
+:end_tab:
diff --git a/chapter_natural-language-processing-pretraining/bert-dataset_origin.md b/chapter_natural-language-processing-pretraining/bert-dataset_origin.md
new file mode 100644
index 000000000..1b1c1115e
--- /dev/null
+++ b/chapter_natural-language-processing-pretraining/bert-dataset_origin.md
@@ -0,0 +1,428 @@
+# The Dataset for Pretraining BERT
+:label:`sec_bert-dataset`
+
+To pretrain the BERT model as implemented in :numref:`sec_bert`,
+we need to generate the dataset in the ideal format to facilitate
+the two pretraining tasks:
+masked language modeling and next sentence prediction.
+On one hand,
+the original BERT model is pretrained on the concatenation of
+two huge corpora BookCorpus and English Wikipedia (see :numref:`subsec_bert_pretraining_tasks`),
+making it hard to run for most readers of this book.
+On the other hand,
+the off-the-shelf pretrained BERT model
+may not fit for applications from specific domains like medicine.
+Thus, it is getting popular to pretrain BERT on a customized dataset.
+To facilitate the demonstration of BERT pretraining,
+we use a smaller corpus WikiText-2 :cite:`Merity.Xiong.Bradbury.ea.2016`.
+
+Comparing with the PTB dataset used for pretraining word2vec in :numref:`sec_word2vec_data`,
+WikiText-2 (i) retains the original punctuation, making it suitable for next sentence prediction; (ii) retains the original case and numbers; (iii) is over twice larger.
+
+```{.python .input}
+from d2l import mxnet as d2l
+from mxnet import gluon, np, npx
+import os
+import random
+
+npx.set_np()
+```
+
+```{.python .input}
+#@tab pytorch
+from d2l import torch as d2l
+import os
+import random
+import torch
+```
+
+In the WikiText-2 dataset,
+each line represents a paragraph where
+space is inserted between any punctuation and its preceding token.
+Paragraphs with at least two sentences are retained.
+To split sentences, we only use the period as the delimiter for simplicity.
+We leave discussions of more complex sentence splitting techniques in the exercises
+at the end of this section.
+
+```{.python .input}
+#@tab all
+#@save
+d2l.DATA_HUB['wikitext-2'] = (
+    'https://s3.amazonaws.com/research.metamind.io/wikitext/'
+    'wikitext-2-v1.zip', '3c914d17d80b1459be871a5039ac23e752a53cbe')
+
+#@save
+def _read_wiki(data_dir):
+    file_name = os.path.join(data_dir, 'wiki.train.tokens')
+    with open(file_name, 'r') as f:
+        lines = f.readlines()
+    # Uppercase letters are converted to lowercase ones
+    paragraphs = [line.strip().lower().split(' . ')
+                  for line in lines if len(line.split(' . ')) >= 2]
+    random.shuffle(paragraphs)
+    return paragraphs
+```
+
+## Defining Helper Functions for Pretraining Tasks
+
+In the following,
+we begin by implementing helper functions for the two BERT pretraining tasks:
+next sentence prediction and masked language modeling.
+These helper functions will be invoked later
+when transforming the raw text corpus
+into the dataset of the ideal format to pretrain BERT.
+
+### Generating the Next Sentence Prediction Task
+
+According to descriptions of :numref:`subsec_nsp`,
+the `_get_next_sentence` function generates a training example
+for the binary classification task.
+
+```{.python .input}
+#@tab all
+#@save
+def _get_next_sentence(sentence, next_sentence, paragraphs):
+    if random.random() < 0.5:
+        is_next = True
+    else:
+        # `paragraphs` is a list of lists of lists
+        next_sentence = random.choice(random.choice(paragraphs))
+        is_next = False
+    return sentence, next_sentence, is_next
+```
+
+The following function generates training examples for next sentence prediction
+from the input `paragraph` by invoking the `_get_next_sentence` function.
+Here `paragraph` is a list of sentences, where each sentence is a list of tokens.
+The argument `max_len` specifies the maximum length of a BERT input sequence during pretraining.
+
+```{.python .input}
+#@tab all
+#@save
+def _get_nsp_data_from_paragraph(paragraph, paragraphs, vocab, max_len):
+    nsp_data_from_paragraph = []
+    for i in range(len(paragraph) - 1):
+        tokens_a, tokens_b, is_next = _get_next_sentence(
+            paragraph[i], paragraph[i + 1], paragraphs)
+        # Consider 1 '<cls>' token and 2 '<sep>' tokens
+        if len(tokens_a) + len(tokens_b) + 3 > max_len:
+            continue
+        tokens, segments = d2l.get_tokens_and_segments(tokens_a, tokens_b)
+        nsp_data_from_paragraph.append((tokens, segments, is_next))
+    return nsp_data_from_paragraph
+```
+
+### Generating the Masked Language Modeling Task
+:label:`subsec_prepare_mlm_data`
+
+In order to generate training examples
+for the masked language modeling task
+from a BERT input sequence,
+we define the following `_replace_mlm_tokens` function.
+In its inputs, `tokens` is a list of tokens representing a BERT input sequence,
+`candidate_pred_positions` is a list of token indices of the BERT input sequence
+excluding those of special tokens (special tokens are not predicted in the masked language modeling task),
+and `num_mlm_preds` indicates the number of predictions (recall 15% random tokens to predict).
+Following the definition of the masked language modeling task in :numref:`subsec_mlm`,
+at each prediction position, the input may be replaced by
+a special “&lt;mask&gt;” token or a random token, or remain unchanged.
+In the end, the function returns the input tokens after possible replacement,
+the token indices where predictions take place and labels for these predictions.
+
+```{.python .input}
+#@tab all
+#@save
+def _replace_mlm_tokens(tokens, candidate_pred_positions, num_mlm_preds,
+                        vocab):
+    # Make a new copy of tokens for the input of a masked language model,
+    # where the input may contain replaced '<mask>' or random tokens
+    mlm_input_tokens = [token for token in tokens]
+    pred_positions_and_labels = []
+    # Shuffle for getting 15% random tokens for prediction in the masked
+    # language modeling task
+    random.shuffle(candidate_pred_positions)
+    for mlm_pred_position in candidate_pred_positions:
+        if len(pred_positions_and_labels) >= num_mlm_preds:
+            break
+        masked_token = None
+        # 80% of the time: replace the word with the '<mask>' token
+        if random.random() < 0.8:
+            masked_token = '<mask>'
+        else:
+            # 10% of the time: keep the word unchanged
+            if random.random() < 0.5:
+                masked_token = tokens[mlm_pred_position]
+            # 10% of the time: replace the word with a random word
+            else:
+                masked_token = random.randint(0, len(vocab) - 1)
+        mlm_input_tokens[mlm_pred_position] = masked_token
+        pred_positions_and_labels.append(
+            (mlm_pred_position, tokens[mlm_pred_position]))
+    return mlm_input_tokens, pred_positions_and_labels
+```
+
+By invoking the aforementioned `_replace_mlm_tokens` function,
+the following function takes a BERT input sequence (`tokens`)
+as an input and returns indices of the input tokens
+(after possible token replacement as described in :numref:`subsec_mlm`),
+the token indices where predictions take place,
+and label indices for these predictions.
+
+```{.python .input}
+#@tab all
+#@save
+def _get_mlm_data_from_tokens(tokens, vocab):
+    candidate_pred_positions = []
+    # `tokens` is a list of strings
+    for i, token in enumerate(tokens):
+        # Special tokens are not predicted in the masked language modeling
+        # task
+        if token in ['<cls>', '<sep>']:
+            continue
+        candidate_pred_positions.append(i)
+    # 15% of random tokens are predicted in the masked language modeling task
+    num_mlm_preds = max(1, round(len(tokens) * 0.15))
+    mlm_input_tokens, pred_positions_and_labels = _replace_mlm_tokens(
+        tokens, candidate_pred_positions, num_mlm_preds, vocab)
+    pred_positions_and_labels = sorted(pred_positions_and_labels,
+                                       key=lambda x: x[0])
+    pred_positions = [v[0] for v in pred_positions_and_labels]
+    mlm_pred_labels = [v[1] for v in pred_positions_and_labels]
+    return vocab[mlm_input_tokens], pred_positions, vocab[mlm_pred_labels]
+```
+
+## Transforming Text into the Pretraining Dataset
+
+Now we are almost ready to customize a `Dataset` class for pretraining BERT.
+Before that, 
+we still need to define a helper function `_pad_bert_inputs`
+to append the special “&lt;mask&gt;” tokens to the inputs.
+Its argument `examples` contain the outputs from the helper functions `_get_nsp_data_from_paragraph` and `_get_mlm_data_from_tokens` for the two pretraining tasks.
+
+```{.python .input}
+#@save
+def _pad_bert_inputs(examples, max_len, vocab):
+    max_num_mlm_preds = round(max_len * 0.15)
+    all_token_ids, all_segments, valid_lens,  = [], [], []
+    all_pred_positions, all_mlm_weights, all_mlm_labels = [], [], []
+    nsp_labels = []
+    for (token_ids, pred_positions, mlm_pred_label_ids, segments,
+         is_next) in examples:
+        all_token_ids.append(np.array(token_ids + [vocab['<pad>']] * (
+            max_len - len(token_ids)), dtype='int32'))
+        all_segments.append(np.array(segments + [0] * (
+            max_len - len(segments)), dtype='int32'))
+        # `valid_lens` excludes count of '<pad>' tokens
+        valid_lens.append(np.array(len(token_ids), dtype='float32'))
+        all_pred_positions.append(np.array(pred_positions + [0] * (
+            max_num_mlm_preds - len(pred_positions)), dtype='int32'))
+        # Predictions of padded tokens will be filtered out in the loss via
+        # multiplication of 0 weights
+        all_mlm_weights.append(
+            np.array([1.0] * len(mlm_pred_label_ids) + [0.0] * (
+                max_num_mlm_preds - len(pred_positions)), dtype='float32'))
+        all_mlm_labels.append(np.array(mlm_pred_label_ids + [0] * (
+            max_num_mlm_preds - len(mlm_pred_label_ids)), dtype='int32'))
+        nsp_labels.append(np.array(is_next))
+    return (all_token_ids, all_segments, valid_lens, all_pred_positions,
+            all_mlm_weights, all_mlm_labels, nsp_labels)
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+def _pad_bert_inputs(examples, max_len, vocab):
+    max_num_mlm_preds = round(max_len * 0.15)
+    all_token_ids, all_segments, valid_lens,  = [], [], []
+    all_pred_positions, all_mlm_weights, all_mlm_labels = [], [], []
+    nsp_labels = []
+    for (token_ids, pred_positions, mlm_pred_label_ids, segments,
+         is_next) in examples:
+        all_token_ids.append(torch.tensor(token_ids + [vocab['<pad>']] * (
+            max_len - len(token_ids)), dtype=torch.long))
+        all_segments.append(torch.tensor(segments + [0] * (
+            max_len - len(segments)), dtype=torch.long))
+        # `valid_lens` excludes count of '<pad>' tokens
+        valid_lens.append(torch.tensor(len(token_ids), dtype=torch.float32))
+        all_pred_positions.append(torch.tensor(pred_positions + [0] * (
+            max_num_mlm_preds - len(pred_positions)), dtype=torch.long))
+        # Predictions of padded tokens will be filtered out in the loss via
+        # multiplication of 0 weights
+        all_mlm_weights.append(
+            torch.tensor([1.0] * len(mlm_pred_label_ids) + [0.0] * (
+                max_num_mlm_preds - len(pred_positions)),
+                dtype=torch.float32))
+        all_mlm_labels.append(torch.tensor(mlm_pred_label_ids + [0] * (
+            max_num_mlm_preds - len(mlm_pred_label_ids)), dtype=torch.long))
+        nsp_labels.append(torch.tensor(is_next, dtype=torch.long))
+    return (all_token_ids, all_segments, valid_lens, all_pred_positions,
+            all_mlm_weights, all_mlm_labels, nsp_labels)
+```
+
+Putting the helper functions for
+generating training examples of the two pretraining tasks,
+and the helper function for padding inputs together,
+we customize the following `_WikiTextDataset` class as the WikiText-2 dataset for pretraining BERT.
+By implementing the `__getitem__ `function,
+we can arbitrarily access the pretraining (masked language modeling and next sentence prediction) examples 
+generated from a pair of sentences from the WikiText-2 corpus.
+
+The original BERT model uses WordPiece embeddings whose vocabulary size is 30000 :cite:`Wu.Schuster.Chen.ea.2016`.
+The tokenization method of WordPiece is a slight modification of
+the original byte pair encoding algorithm in :numref:`subsec_Byte_Pair_Encoding`.
+For simplicity, we use the `d2l.tokenize` function for tokenization.
+Infrequent tokens that appear less than five times are filtered out.
+
+```{.python .input}
+#@save
+class _WikiTextDataset(gluon.data.Dataset):
+    def __init__(self, paragraphs, max_len):
+        # Input `paragraphs[i]` is a list of sentence strings representing a
+        # paragraph; while output `paragraphs[i]` is a list of sentences
+        # representing a paragraph, where each sentence is a list of tokens
+        paragraphs = [d2l.tokenize(
+            paragraph, token='word') for paragraph in paragraphs]
+        sentences = [sentence for paragraph in paragraphs
+                     for sentence in paragraph]
+        self.vocab = d2l.Vocab(sentences, min_freq=5, reserved_tokens=[
+            '<pad>', '<mask>', '<cls>', '<sep>'])
+        # Get data for the next sentence prediction task
+        examples = []
+        for paragraph in paragraphs:
+            examples.extend(_get_nsp_data_from_paragraph(
+                paragraph, paragraphs, self.vocab, max_len))
+        # Get data for the masked language model task
+        examples = [(_get_mlm_data_from_tokens(tokens, self.vocab)
+                      + (segments, is_next))
+                     for tokens, segments, is_next in examples]
+        # Pad inputs
+        (self.all_token_ids, self.all_segments, self.valid_lens,
+         self.all_pred_positions, self.all_mlm_weights,
+         self.all_mlm_labels, self.nsp_labels) = _pad_bert_inputs(
+            examples, max_len, self.vocab)
+
+    def __getitem__(self, idx):
+        return (self.all_token_ids[idx], self.all_segments[idx],
+                self.valid_lens[idx], self.all_pred_positions[idx],
+                self.all_mlm_weights[idx], self.all_mlm_labels[idx],
+                self.nsp_labels[idx])
+
+    def __len__(self):
+        return len(self.all_token_ids)
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+class _WikiTextDataset(torch.utils.data.Dataset):
+    def __init__(self, paragraphs, max_len):
+        # Input `paragraphs[i]` is a list of sentence strings representing a
+        # paragraph; while output `paragraphs[i]` is a list of sentences
+        # representing a paragraph, where each sentence is a list of tokens
+        paragraphs = [d2l.tokenize(
+            paragraph, token='word') for paragraph in paragraphs]
+        sentences = [sentence for paragraph in paragraphs
+                     for sentence in paragraph]
+        self.vocab = d2l.Vocab(sentences, min_freq=5, reserved_tokens=[
+            '<pad>', '<mask>', '<cls>', '<sep>'])
+        # Get data for the next sentence prediction task
+        examples = []
+        for paragraph in paragraphs:
+            examples.extend(_get_nsp_data_from_paragraph(
+                paragraph, paragraphs, self.vocab, max_len))
+        # Get data for the masked language model task
+        examples = [(_get_mlm_data_from_tokens(tokens, self.vocab)
+                      + (segments, is_next))
+                     for tokens, segments, is_next in examples]
+        # Pad inputs
+        (self.all_token_ids, self.all_segments, self.valid_lens,
+         self.all_pred_positions, self.all_mlm_weights,
+         self.all_mlm_labels, self.nsp_labels) = _pad_bert_inputs(
+            examples, max_len, self.vocab)
+
+    def __getitem__(self, idx):
+        return (self.all_token_ids[idx], self.all_segments[idx],
+                self.valid_lens[idx], self.all_pred_positions[idx],
+                self.all_mlm_weights[idx], self.all_mlm_labels[idx],
+                self.nsp_labels[idx])
+
+    def __len__(self):
+        return len(self.all_token_ids)
+```
+
+By using the `_read_wiki` function and the `_WikiTextDataset` class,
+we define the following `load_data_wiki` to download and WikiText-2 dataset
+and generate pretraining examples from it.
+
+```{.python .input}
+#@save
+def load_data_wiki(batch_size, max_len):
+    """Load the WikiText-2 dataset."""
+    num_workers = d2l.get_dataloader_workers()
+    data_dir = d2l.download_extract('wikitext-2', 'wikitext-2')
+    paragraphs = _read_wiki(data_dir)
+    train_set = _WikiTextDataset(paragraphs, max_len)
+    train_iter = gluon.data.DataLoader(train_set, batch_size, shuffle=True,
+                                       num_workers=num_workers)
+    return train_iter, train_set.vocab
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+def load_data_wiki(batch_size, max_len):
+    """Load the WikiText-2 dataset."""
+    num_workers = d2l.get_dataloader_workers()
+    data_dir = d2l.download_extract('wikitext-2', 'wikitext-2')
+    paragraphs = _read_wiki(data_dir)
+    train_set = _WikiTextDataset(paragraphs, max_len)
+    train_iter = torch.utils.data.DataLoader(train_set, batch_size,
+                                        shuffle=True, num_workers=num_workers)
+    return train_iter, train_set.vocab
+```
+
+Setting the batch size to 512 and the maximum length of a BERT input sequence to be 64,
+we print out the shapes of a minibatch of BERT pretraining examples.
+Note that in each BERT input sequence,
+$10$ ($64 \times 0.15$) positions are predicted for the masked language modeling task.
+
+```{.python .input}
+#@tab all
+batch_size, max_len = 512, 64
+train_iter, vocab = load_data_wiki(batch_size, max_len)
+
+for (tokens_X, segments_X, valid_lens_x, pred_positions_X, mlm_weights_X,
+     mlm_Y, nsp_y) in train_iter:
+    print(tokens_X.shape, segments_X.shape, valid_lens_x.shape,
+          pred_positions_X.shape, mlm_weights_X.shape, mlm_Y.shape,
+          nsp_y.shape)
+    break
+```
+
+In the end, let us take a look at the vocabulary size.
+Even after filtering out infrequent tokens,
+it is still over twice larger than that of the PTB dataset.
+
+```{.python .input}
+#@tab all
+len(vocab)
+```
+
+## Summary
+
+* Comparing with the PTB dataset, the WikiText-2 dateset retains the original punctuation, case and numbers, and is over twice larger.
+* We can arbitrarily access the pretraining (masked language modeling and next sentence prediction) examples generated from a pair of sentences from the WikiText-2 corpus.
+
+
+## Exercises
+
+1. For simplicity, the period is used as the only delimiter for splitting sentences. Try other sentence splitting techniques, such as the spaCy and NLTK. Take NLTK as an example. You need to install NLTK first: `pip install nltk`. In the code, first `import nltk`. Then, download the Punkt sentence tokenizer: `nltk.download('punkt')`. To split sentences such as `sentences = 'This is great ! Why not ?'`, invoking `nltk.tokenize.sent_tokenize(sentences)` will return a list of two sentence strings: `['This is great !', 'Why not ?']`.
+1. What is the vocabulary size if we do not filter out any infrequent token?
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/389)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/1496)
+:end_tab:
diff --git a/chapter_natural-language-processing-pretraining/bert-pretraining.md b/chapter_natural-language-processing-pretraining/bert-pretraining.md
new file mode 100644
index 000000000..9baf732a3
--- /dev/null
+++ b/chapter_natural-language-processing-pretraining/bert-pretraining.md
@@ -0,0 +1,271 @@
+# 培训前培训 BERT
+:label:`sec_bert-pretraining`
+
+随着在 :numref:`sec_bert` 中实施了 BERT 模型，以及 :numref:`sec_bert-dataset` 中从 WikiText-2 数据集生成的预训练示例，我们将在本节的 WikiText-2 数据集上预训练 BERT。
+
+```{.python .input}
+from d2l import mxnet as d2l
+from mxnet import autograd, gluon, init, np, npx
+
+npx.set_np()
+```
+
+```{.python .input}
+#@tab pytorch
+from d2l import torch as d2l
+import torch
+from torch import nn
+```
+
+首先，我们加载 Wikitext-2 数据集作为用于掩码语言建模和下一句话预测的预训练示例的小组。批次大小为 512，BERT 输入序列的最大长度为 64。请注意，在原始的 BERT 模型中，最大长度为 512。
+
+```{.python .input}
+#@tab all
+batch_size, max_len = 512, 64
+train_iter, vocab = d2l.load_data_wiki(batch_size, max_len)
+```
+
+## 培训前培训 BERT
+
+原来的 BERT 有两个不同型号尺寸 :cite:`Devlin.Chang.Lee.ea.2018` 的版本。基本型号（$\text{BERT}_{\text{BASE}}$）使用 12 层（变压器编码器块），其中包含 768 个隐藏单元（隐藏尺寸）和 12 个自我注意头。大型模型（$\text{BERT}_{\text{LARGE}}$）使用 24 层，其中有 1024 个隐藏单元和 16 个自我注意头。值得注意的是，前者有 1.1 亿个参数，而后者有 3.4 亿个参数。为了轻松进行演示，我们定义了一个小型 BERT，它使用 2 层、128 个隐藏单位和 2 个自我注意头。
+
+```{.python .input}
+net = d2l.BERTModel(len(vocab), num_hiddens=128, ffn_num_hiddens=256,
+                    num_heads=2, num_layers=2, dropout=0.2)
+devices = d2l.try_all_gpus()
+net.initialize(init.Xavier(), ctx=devices)
+loss = gluon.loss.SoftmaxCELoss()
+```
+
+```{.python .input}
+#@tab pytorch
+net = d2l.BERTModel(len(vocab), num_hiddens=128, norm_shape=[128],
+                    ffn_num_input=128, ffn_num_hiddens=256, num_heads=2,
+                    num_layers=2, dropout=0.2, key_size=128, query_size=128,
+                    value_size=128, hid_in_features=128, mlm_in_features=128,
+                    nsp_in_features=128)
+devices = d2l.try_all_gpus()
+loss = nn.CrossEntropyLoss()
+```
+
+在定义训练循环之前，我们定义了一个助手函数 `_get_batch_loss_bert`。鉴于训练示例的数量，此函数计算蒙版语言建模和下一句预测任务的损失。请注意，BERT 预训练的最后损失只是蒙版语言建模损失和下一句预测损失的总和。
+
+```{.python .input}
+#@save
+def _get_batch_loss_bert(net, loss, vocab_size, tokens_X_shards,
+                         segments_X_shards, valid_lens_x_shards,
+                         pred_positions_X_shards, mlm_weights_X_shards,
+                         mlm_Y_shards, nsp_y_shards):
+    mlm_ls, nsp_ls, ls = [], [], []
+    for (tokens_X_shard, segments_X_shard, valid_lens_x_shard,
+         pred_positions_X_shard, mlm_weights_X_shard, mlm_Y_shard,
+         nsp_y_shard) in zip(
+        tokens_X_shards, segments_X_shards, valid_lens_x_shards,
+        pred_positions_X_shards, mlm_weights_X_shards, mlm_Y_shards,
+        nsp_y_shards):
+        # Forward pass
+        _, mlm_Y_hat, nsp_Y_hat = net(
+            tokens_X_shard, segments_X_shard, valid_lens_x_shard.reshape(-1),
+            pred_positions_X_shard)
+        # Compute masked language model loss
+        mlm_l = loss(
+            mlm_Y_hat.reshape((-1, vocab_size)), mlm_Y_shard.reshape(-1),
+            mlm_weights_X_shard.reshape((-1, 1)))
+        mlm_l = mlm_l.sum() / (mlm_weights_X_shard.sum() + 1e-8)
+        # Compute next sentence prediction loss
+        nsp_l = loss(nsp_Y_hat, nsp_y_shard)
+        nsp_l = nsp_l.mean()
+        mlm_ls.append(mlm_l)
+        nsp_ls.append(nsp_l)
+        ls.append(mlm_l + nsp_l)
+        npx.waitall()
+    return mlm_ls, nsp_ls, ls
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+def _get_batch_loss_bert(net, loss, vocab_size, tokens_X,
+                         segments_X, valid_lens_x,
+                         pred_positions_X, mlm_weights_X,
+                         mlm_Y, nsp_y):
+    # Forward pass
+    _, mlm_Y_hat, nsp_Y_hat = net(tokens_X, segments_X,
+                                  valid_lens_x.reshape(-1),
+                                  pred_positions_X)
+    # Compute masked language model loss
+    mlm_l = loss(mlm_Y_hat.reshape(-1, vocab_size), mlm_Y.reshape(-1)) *\
+    mlm_weights_X.reshape(-1, 1)
+    mlm_l = mlm_l.sum() / (mlm_weights_X.sum() + 1e-8)
+    # Compute next sentence prediction loss
+    nsp_l = loss(nsp_Y_hat, nsp_y)
+    l = mlm_l + nsp_l
+    return mlm_l, nsp_l, l
+```
+
+调用上述两个辅助函数，以下 `train_bert` 函数定义了在 Wikitext-2 (`train_iter`) 数据集上预训练 BERT (`net`) 的过程。培训 BERT 可能需要很长时间。以下函数的输入 `num_steps` 没有像 `train_ch13` 函数那样指定训练的时代数量（参见 :numref:`sec_image_augmentation`），而是指定训练的迭代步数。
+
+```{.python .input}
+def train_bert(train_iter, net, loss, vocab_size, devices, num_steps):
+    trainer = gluon.Trainer(net.collect_params(), 'adam',
+                            {'learning_rate': 1e-3})
+    step, timer = 0, d2l.Timer()
+    animator = d2l.Animator(xlabel='step', ylabel='loss',
+                            xlim=[1, num_steps], legend=['mlm', 'nsp'])
+    # Sum of masked language modeling losses, sum of next sentence prediction
+    # losses, no. of sentence pairs, count
+    metric = d2l.Accumulator(4)
+    num_steps_reached = False
+    while step < num_steps and not num_steps_reached:
+        for batch in train_iter:
+            (tokens_X_shards, segments_X_shards, valid_lens_x_shards,
+             pred_positions_X_shards, mlm_weights_X_shards,
+             mlm_Y_shards, nsp_y_shards) = [gluon.utils.split_and_load(
+                elem, devices, even_split=False) for elem in batch]
+            timer.start()
+            with autograd.record():
+                mlm_ls, nsp_ls, ls = _get_batch_loss_bert(
+                    net, loss, vocab_size, tokens_X_shards, segments_X_shards,
+                    valid_lens_x_shards, pred_positions_X_shards,
+                    mlm_weights_X_shards, mlm_Y_shards, nsp_y_shards)
+            for l in ls:
+                l.backward()
+            trainer.step(1)
+            mlm_l_mean = sum([float(l) for l in mlm_ls]) / len(mlm_ls)
+            nsp_l_mean = sum([float(l) for l in nsp_ls]) / len(nsp_ls)
+            metric.add(mlm_l_mean, nsp_l_mean, batch[0].shape[0], 1)
+            timer.stop()
+            animator.add(step + 1,
+                         (metric[0] / metric[3], metric[1] / metric[3]))
+            step += 1
+            if step == num_steps:
+                num_steps_reached = True
+                break
+
+    print(f'MLM loss {metric[0] / metric[3]:.3f}, '
+          f'NSP loss {metric[1] / metric[3]:.3f}')
+    print(f'{metric[2] / timer.sum():.1f} sentence pairs/sec on '
+          f'{str(devices)}')
+```
+
+```{.python .input}
+#@tab pytorch
+def train_bert(train_iter, net, loss, vocab_size, devices, num_steps):
+    net = nn.DataParallel(net, device_ids=devices).to(devices[0])
+    trainer = torch.optim.Adam(net.parameters(), lr=1e-3)
+    step, timer = 0, d2l.Timer()
+    animator = d2l.Animator(xlabel='step', ylabel='loss',
+                            xlim=[1, num_steps], legend=['mlm', 'nsp'])
+    # Sum of masked language modeling losses, sum of next sentence prediction
+    # losses, no. of sentence pairs, count
+    metric = d2l.Accumulator(4)
+    num_steps_reached = False
+    while step < num_steps and not num_steps_reached:
+        for tokens_X, segments_X, valid_lens_x, pred_positions_X,\
+            mlm_weights_X, mlm_Y, nsp_y in train_iter:
+            tokens_X = tokens_X.to(devices[0])
+            segments_X = segments_X.to(devices[0])
+            valid_lens_x = valid_lens_x.to(devices[0])
+            pred_positions_X = pred_positions_X.to(devices[0])
+            mlm_weights_X = mlm_weights_X.to(devices[0])
+            mlm_Y, nsp_y = mlm_Y.to(devices[0]), nsp_y.to(devices[0])
+            trainer.zero_grad()
+            timer.start()
+            mlm_l, nsp_l, l = _get_batch_loss_bert(
+                net, loss, vocab_size, tokens_X, segments_X, valid_lens_x,
+                pred_positions_X, mlm_weights_X, mlm_Y, nsp_y)
+            l.backward()
+            trainer.step()
+            metric.add(mlm_l, nsp_l, tokens_X.shape[0], 1)
+            timer.stop()
+            animator.add(step + 1,
+                         (metric[0] / metric[3], metric[1] / metric[3]))
+            step += 1
+            if step == num_steps:
+                num_steps_reached = True
+                break
+
+    print(f'MLM loss {metric[0] / metric[3]:.3f}, '
+          f'NSP loss {metric[1] / metric[3]:.3f}')
+    print(f'{metric[2] / timer.sum():.1f} sentence pairs/sec on '
+          f'{str(devices)}')
+```
+
+我们可以绘制 BERT 预训期间的蒙版语言建模损失和下一句话预测损失。
+
+```{.python .input}
+#@tab all
+train_bert(train_iter, net, loss, len(vocab), devices, 50)
+```
+
+## 用 BERT 表示文本
+
+在预训练 BERT 之后，我们可以用它来表示单个文本、文本对或其中的任何标记。以下函数返回 `tokens_a` 和 `tokens_b` 中所有令牌的 BERT (`net`) 表示形式。
+
+```{.python .input}
+def get_bert_encoding(net, tokens_a, tokens_b=None):
+    tokens, segments = d2l.get_tokens_and_segments(tokens_a, tokens_b)
+    token_ids = np.expand_dims(np.array(vocab[tokens], ctx=devices[0]),
+                               axis=0)
+    segments = np.expand_dims(np.array(segments, ctx=devices[0]), axis=0)
+    valid_len = np.expand_dims(np.array(len(tokens), ctx=devices[0]), axis=0)
+    encoded_X, _, _ = net(token_ids, segments, valid_len)
+    return encoded_X
+```
+
+```{.python .input}
+#@tab pytorch
+def get_bert_encoding(net, tokens_a, tokens_b=None):
+    tokens, segments = d2l.get_tokens_and_segments(tokens_a, tokens_b)
+    token_ids = torch.tensor(vocab[tokens], device=devices[0]).unsqueeze(0)
+    segments = torch.tensor(segments, device=devices[0]).unsqueeze(0)
+    valid_len = torch.tensor(len(tokens), device=devices[0]).unsqueeze(0)
+    encoded_X, _, _ = net(token_ids, segments, valid_len)
+    return encoded_X
+```
+
+考虑一下 “起重机在飞” 这句话。回想一下 :numref:`subsec_bert_input_rep` 中讨论的 BERT 的输入表示形式。插入特殊标记 “<cls>”（用于分类）和 “<sep>”（用于分隔）后，BERT 输入序列的长度为 6。由于零是 “<cls>” 令牌的索引，所以 `encoded_text[:, 0, :]` 是整个输入句子的 BERT 表示。为了评估 polysemy 令牌 “鹤”，我们还打印了代币 BERT 表示的前三个元素。
+
+```{.python .input}
+#@tab all
+tokens_a = ['a', 'crane', 'is', 'flying']
+encoded_text = get_bert_encoding(net, tokens_a)
+# Tokens: '<cls>', 'a', 'crane', 'is', 'flying', '<sep>'
+encoded_text_cls = encoded_text[:, 0, :]
+encoded_text_crane = encoded_text[:, 2, :]
+encoded_text.shape, encoded_text_cls.shape, encoded_text_crane[0][:3]
+```
+
+现在考虑一对句子 “起重机司机来了” 和 “他刚离开”。同样，`encoded_pair[:, 0, :]` 是预训练的 BERT 整个句子对的编码结果。请注意，polysemy 令牌 “鹤” 的前三个元素与上下文不同时的前三个元素不同。这支持 BERT 表示是上下文相关的。
+
+```{.python .input}
+#@tab all
+tokens_a, tokens_b = ['a', 'crane', 'driver', 'came'], ['he', 'just', 'left']
+encoded_pair = get_bert_encoding(net, tokens_a, tokens_b)
+# Tokens: '<cls>', 'a', 'crane', 'driver', 'came', '<sep>', 'he', 'just',
+# 'left', '<sep>'
+encoded_pair_cls = encoded_pair[:, 0, :]
+encoded_pair_crane = encoded_pair[:, 2, :]
+encoded_pair.shape, encoded_pair_cls.shape, encoded_pair_crane[0][:3]
+```
+
+在 :numref:`chap_nlp_app` 中，我们将为下游自然语言处理应用程序微调预训练的 BERT 模型。 
+
+## 摘要
+
+* 原来的 BERT 有两个版本，其中基本模型有 1.1 亿个参数，而大型模型有 3.4 亿个参数。
+* 在预训练 BERT 之后，我们可以用它来表示单个文本、文本对或其中的任何标记。
+* 在实验中，当上下文不同时，同一个令牌具有不同的 BERT 表示形式。这支持 BERT 表示是上下文相关的。
+
+## 练习
+
+1. 在实验中，我们可以看到，蒙版语言建模损失明显高于下一个句子预测损失。为什么？
+2. 将 BERT 输入序列的最大长度设置为 512（与原始 BERT 模型相同）。使用原始 BERT 模型的配置，例如 $\text{BERT}_{\text{LARGE}}$。运行此部分时你会遇到任何错误吗？为什么？
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/390)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/1497)
+:end_tab:
diff --git a/chapter_natural-language-processing-pretraining/bert-pretraining_origin.md b/chapter_natural-language-processing-pretraining/bert-pretraining_origin.md
new file mode 100644
index 000000000..3acae1af1
--- /dev/null
+++ b/chapter_natural-language-processing-pretraining/bert-pretraining_origin.md
@@ -0,0 +1,313 @@
+# Pretraining BERT
+:label:`sec_bert-pretraining`
+
+With the BERT model implemented in :numref:`sec_bert`
+and the pretraining examples generated from the WikiText-2 dataset in :numref:`sec_bert-dataset`, we will pretrain BERT on the WikiText-2 dataset in this section.
+
+```{.python .input}
+from d2l import mxnet as d2l
+from mxnet import autograd, gluon, init, np, npx
+
+npx.set_np()
+```
+
+```{.python .input}
+#@tab pytorch
+from d2l import torch as d2l
+import torch
+from torch import nn
+```
+
+To start, we load the WikiText-2 dataset as minibatches
+of pretraining examples for masked language modeling and next sentence prediction.
+The batch size is 512 and the maximum length of a BERT input sequence is 64.
+Note that in the original BERT model, the maximum length is 512.
+
+```{.python .input}
+#@tab all
+batch_size, max_len = 512, 64
+train_iter, vocab = d2l.load_data_wiki(batch_size, max_len)
+```
+
+## Pretraining BERT
+
+The original BERT has two versions of different model sizes :cite:`Devlin.Chang.Lee.ea.2018`.
+The base model ($\text{BERT}_{\text{BASE}}$) uses 12 layers (transformer encoder blocks)
+with 768 hidden units (hidden size) and 12 self-attention heads.
+The large model ($\text{BERT}_{\text{LARGE}}$) uses 24 layers
+with 1024 hidden units and 16 self-attention heads.
+Notably, the former has 110 million parameters while the latter has 340 million parameters.
+For demonstration with ease,
+we define a small BERT, using 2 layers, 128 hidden units, and 2 self-attention heads.
+
+```{.python .input}
+net = d2l.BERTModel(len(vocab), num_hiddens=128, ffn_num_hiddens=256,
+                    num_heads=2, num_layers=2, dropout=0.2)
+devices = d2l.try_all_gpus()
+net.initialize(init.Xavier(), ctx=devices)
+loss = gluon.loss.SoftmaxCELoss()
+```
+
+```{.python .input}
+#@tab pytorch
+net = d2l.BERTModel(len(vocab), num_hiddens=128, norm_shape=[128],
+                    ffn_num_input=128, ffn_num_hiddens=256, num_heads=2,
+                    num_layers=2, dropout=0.2, key_size=128, query_size=128,
+                    value_size=128, hid_in_features=128, mlm_in_features=128,
+                    nsp_in_features=128)
+devices = d2l.try_all_gpus()
+loss = nn.CrossEntropyLoss()
+```
+
+Before defining the training loop,
+we define a helper function `_get_batch_loss_bert`.
+Given the shard of training examples,
+this function computes the loss for both the masked language modeling and next sentence prediction tasks.
+Note that the final loss of BERT pretraining
+is just the sum of both the masked language modeling loss
+and the next sentence prediction loss.
+
+```{.python .input}
+#@save
+def _get_batch_loss_bert(net, loss, vocab_size, tokens_X_shards,
+                         segments_X_shards, valid_lens_x_shards,
+                         pred_positions_X_shards, mlm_weights_X_shards,
+                         mlm_Y_shards, nsp_y_shards):
+    mlm_ls, nsp_ls, ls = [], [], []
+    for (tokens_X_shard, segments_X_shard, valid_lens_x_shard,
+         pred_positions_X_shard, mlm_weights_X_shard, mlm_Y_shard,
+         nsp_y_shard) in zip(
+        tokens_X_shards, segments_X_shards, valid_lens_x_shards,
+        pred_positions_X_shards, mlm_weights_X_shards, mlm_Y_shards,
+        nsp_y_shards):
+        # Forward pass
+        _, mlm_Y_hat, nsp_Y_hat = net(
+            tokens_X_shard, segments_X_shard, valid_lens_x_shard.reshape(-1),
+            pred_positions_X_shard)
+        # Compute masked language model loss
+        mlm_l = loss(
+            mlm_Y_hat.reshape((-1, vocab_size)), mlm_Y_shard.reshape(-1),
+            mlm_weights_X_shard.reshape((-1, 1)))
+        mlm_l = mlm_l.sum() / (mlm_weights_X_shard.sum() + 1e-8)
+        # Compute next sentence prediction loss
+        nsp_l = loss(nsp_Y_hat, nsp_y_shard)
+        nsp_l = nsp_l.mean()
+        mlm_ls.append(mlm_l)
+        nsp_ls.append(nsp_l)
+        ls.append(mlm_l + nsp_l)
+        npx.waitall()
+    return mlm_ls, nsp_ls, ls
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+def _get_batch_loss_bert(net, loss, vocab_size, tokens_X,
+                         segments_X, valid_lens_x,
+                         pred_positions_X, mlm_weights_X,
+                         mlm_Y, nsp_y):
+    # Forward pass
+    _, mlm_Y_hat, nsp_Y_hat = net(tokens_X, segments_X,
+                                  valid_lens_x.reshape(-1),
+                                  pred_positions_X)
+    # Compute masked language model loss
+    mlm_l = loss(mlm_Y_hat.reshape(-1, vocab_size), mlm_Y.reshape(-1)) *\
+    mlm_weights_X.reshape(-1, 1)
+    mlm_l = mlm_l.sum() / (mlm_weights_X.sum() + 1e-8)
+    # Compute next sentence prediction loss
+    nsp_l = loss(nsp_Y_hat, nsp_y)
+    l = mlm_l + nsp_l
+    return mlm_l, nsp_l, l
+```
+
+Invoking the two aforementioned helper functions,
+the following `train_bert` function
+defines the procedure to pretrain BERT (`net`) on the WikiText-2 (`train_iter`) dataset.
+Training BERT can take very long.
+Instead of specifying the number of epochs for training
+as in the `train_ch13` function (see :numref:`sec_image_augmentation`),
+the input `num_steps` of the following function
+specifies the number of iteration steps for training.
+
+```{.python .input}
+def train_bert(train_iter, net, loss, vocab_size, devices, num_steps):
+    trainer = gluon.Trainer(net.collect_params(), 'adam',
+                            {'learning_rate': 1e-3})
+    step, timer = 0, d2l.Timer()
+    animator = d2l.Animator(xlabel='step', ylabel='loss',
+                            xlim=[1, num_steps], legend=['mlm', 'nsp'])
+    # Sum of masked language modeling losses, sum of next sentence prediction
+    # losses, no. of sentence pairs, count
+    metric = d2l.Accumulator(4)
+    num_steps_reached = False
+    while step < num_steps and not num_steps_reached:
+        for batch in train_iter:
+            (tokens_X_shards, segments_X_shards, valid_lens_x_shards,
+             pred_positions_X_shards, mlm_weights_X_shards,
+             mlm_Y_shards, nsp_y_shards) = [gluon.utils.split_and_load(
+                elem, devices, even_split=False) for elem in batch]
+            timer.start()
+            with autograd.record():
+                mlm_ls, nsp_ls, ls = _get_batch_loss_bert(
+                    net, loss, vocab_size, tokens_X_shards, segments_X_shards,
+                    valid_lens_x_shards, pred_positions_X_shards,
+                    mlm_weights_X_shards, mlm_Y_shards, nsp_y_shards)
+            for l in ls:
+                l.backward()
+            trainer.step(1)
+            mlm_l_mean = sum([float(l) for l in mlm_ls]) / len(mlm_ls)
+            nsp_l_mean = sum([float(l) for l in nsp_ls]) / len(nsp_ls)
+            metric.add(mlm_l_mean, nsp_l_mean, batch[0].shape[0], 1)
+            timer.stop()
+            animator.add(step + 1,
+                         (metric[0] / metric[3], metric[1] / metric[3]))
+            step += 1
+            if step == num_steps:
+                num_steps_reached = True
+                break
+
+    print(f'MLM loss {metric[0] / metric[3]:.3f}, '
+          f'NSP loss {metric[1] / metric[3]:.3f}')
+    print(f'{metric[2] / timer.sum():.1f} sentence pairs/sec on '
+          f'{str(devices)}')
+```
+
+```{.python .input}
+#@tab pytorch
+def train_bert(train_iter, net, loss, vocab_size, devices, num_steps):
+    net = nn.DataParallel(net, device_ids=devices).to(devices[0])
+    trainer = torch.optim.Adam(net.parameters(), lr=1e-3)
+    step, timer = 0, d2l.Timer()
+    animator = d2l.Animator(xlabel='step', ylabel='loss',
+                            xlim=[1, num_steps], legend=['mlm', 'nsp'])
+    # Sum of masked language modeling losses, sum of next sentence prediction
+    # losses, no. of sentence pairs, count
+    metric = d2l.Accumulator(4)
+    num_steps_reached = False
+    while step < num_steps and not num_steps_reached:
+        for tokens_X, segments_X, valid_lens_x, pred_positions_X,\
+            mlm_weights_X, mlm_Y, nsp_y in train_iter:
+            tokens_X = tokens_X.to(devices[0])
+            segments_X = segments_X.to(devices[0])
+            valid_lens_x = valid_lens_x.to(devices[0])
+            pred_positions_X = pred_positions_X.to(devices[0])
+            mlm_weights_X = mlm_weights_X.to(devices[0])
+            mlm_Y, nsp_y = mlm_Y.to(devices[0]), nsp_y.to(devices[0])
+            trainer.zero_grad()
+            timer.start()
+            mlm_l, nsp_l, l = _get_batch_loss_bert(
+                net, loss, vocab_size, tokens_X, segments_X, valid_lens_x,
+                pred_positions_X, mlm_weights_X, mlm_Y, nsp_y)
+            l.backward()
+            trainer.step()
+            metric.add(mlm_l, nsp_l, tokens_X.shape[0], 1)
+            timer.stop()
+            animator.add(step + 1,
+                         (metric[0] / metric[3], metric[1] / metric[3]))
+            step += 1
+            if step == num_steps:
+                num_steps_reached = True
+                break
+
+    print(f'MLM loss {metric[0] / metric[3]:.3f}, '
+          f'NSP loss {metric[1] / metric[3]:.3f}')
+    print(f'{metric[2] / timer.sum():.1f} sentence pairs/sec on '
+          f'{str(devices)}')
+```
+
+We can plot both the masked language modeling loss and the next sentence prediction loss
+during BERT pretraining.
+
+```{.python .input}
+#@tab all
+train_bert(train_iter, net, loss, len(vocab), devices, 50)
+```
+
+## Representing Text with BERT
+
+After pretraining BERT,
+we can use it to represent single text, text pairs, or any token in them.
+The following function returns the BERT (`net`) representations for all tokens
+in `tokens_a` and `tokens_b`.
+
+```{.python .input}
+def get_bert_encoding(net, tokens_a, tokens_b=None):
+    tokens, segments = d2l.get_tokens_and_segments(tokens_a, tokens_b)
+    token_ids = np.expand_dims(np.array(vocab[tokens], ctx=devices[0]),
+                               axis=0)
+    segments = np.expand_dims(np.array(segments, ctx=devices[0]), axis=0)
+    valid_len = np.expand_dims(np.array(len(tokens), ctx=devices[0]), axis=0)
+    encoded_X, _, _ = net(token_ids, segments, valid_len)
+    return encoded_X
+```
+
+```{.python .input}
+#@tab pytorch
+def get_bert_encoding(net, tokens_a, tokens_b=None):
+    tokens, segments = d2l.get_tokens_and_segments(tokens_a, tokens_b)
+    token_ids = torch.tensor(vocab[tokens], device=devices[0]).unsqueeze(0)
+    segments = torch.tensor(segments, device=devices[0]).unsqueeze(0)
+    valid_len = torch.tensor(len(tokens), device=devices[0]).unsqueeze(0)
+    encoded_X, _, _ = net(token_ids, segments, valid_len)
+    return encoded_X
+```
+
+Consider the sentence "a crane is flying".
+Recall the input representation of BERT as discussed in :numref:`subsec_bert_input_rep`.
+After inserting special tokens “&lt;cls&gt;” (used for classification)
+and “&lt;sep&gt;” (used for separation),
+the BERT input sequence has a length of six.
+Since zero is the index of the “&lt;cls&gt;” token,
+`encoded_text[:, 0, :]` is the BERT representation of the entire input sentence.
+To evaluate the polysemy token "crane",
+we also print out the first three elements of the BERT representation of the token.
+
+```{.python .input}
+#@tab all
+tokens_a = ['a', 'crane', 'is', 'flying']
+encoded_text = get_bert_encoding(net, tokens_a)
+# Tokens: '<cls>', 'a', 'crane', 'is', 'flying', '<sep>'
+encoded_text_cls = encoded_text[:, 0, :]
+encoded_text_crane = encoded_text[:, 2, :]
+encoded_text.shape, encoded_text_cls.shape, encoded_text_crane[0][:3]
+```
+
+Now consider a sentence pair
+"a crane driver came" and "he just left".
+Similarly, `encoded_pair[:, 0, :]` is the encoded result of the entire sentence pair from the pretrained BERT.
+Note that the first three elements of the polysemy token "crane" are different from those when the context is different.
+This supports that BERT representations are context-sensitive.
+
+```{.python .input}
+#@tab all
+tokens_a, tokens_b = ['a', 'crane', 'driver', 'came'], ['he', 'just', 'left']
+encoded_pair = get_bert_encoding(net, tokens_a, tokens_b)
+# Tokens: '<cls>', 'a', 'crane', 'driver', 'came', '<sep>', 'he', 'just',
+# 'left', '<sep>'
+encoded_pair_cls = encoded_pair[:, 0, :]
+encoded_pair_crane = encoded_pair[:, 2, :]
+encoded_pair.shape, encoded_pair_cls.shape, encoded_pair_crane[0][:3]
+```
+
+In :numref:`chap_nlp_app`, we will fine-tune a pretrained BERT model
+for downstream natural language processing applications.
+
+
+## Summary
+
+* The original BERT has two versions, where the base model has 110 million parameters and the large model has 340 million parameters.
+* After pretraining BERT, we can use it to represent single text, text pairs, or any token in them.
+* In the experiment, the same token has different BERT representation when their contexts are different. This supports that BERT representations are context-sensitive.
+
+## Exercises
+
+1. In the experiment, we can see that the masked language modeling loss is significantly higher than the next sentence prediction loss. Why?
+2. Set the maximum length of a BERT input sequence to be 512 (same as the original BERT model). Use the configurations of the original BERT model such as $\text{BERT}_{\text{LARGE}}$. Do you encounter any error when running this section? Why?
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/390)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/1497)
+:end_tab:
diff --git a/chapter_natural-language-processing-pretraining/bert.md b/chapter_natural-language-processing-pretraining/bert.md
new file mode 100644
index 000000000..27b7374b1
--- /dev/null
+++ b/chapter_natural-language-processing-pretraining/bert.md
@@ -0,0 +1,430 @@
+# 来自变形金刚（BERT）的双向编码器表示
+:label:`sec_bert`
+
+我们引入了几个单词嵌入模型来理解自然语言。训练前后，可以将输出视为矩阵，其中每一行都是表示预定义词汇的一个单词的矢量。事实上，这些单词嵌入模型都是 * 上下文无关 *。让我们首先说明这个财产。 
+
+## 从上下文独立到上下文敏感
+
+回想一下 :numref:`sec_word2vec_pretraining` 和 :numref:`sec_synonyms` 中的实验。例如，word2vec 和 GLOVE 都将相同的预训练向量分配给同一个单词，而不管单词的上下文如何（如果有）。从形式上来说，任何令牌 $x$ 的上下文无关表示是一个函数 $f(x)$，它只需要 $x$ 作为输入。鉴于自然语言中的多聚论和复杂语义的丰富性，与上下文无关的表示有明显的局限性。例如，上下文 “起重机正在飞行” 和 “起重机驾驶员来了” 中的 “起重机” 一词具有完全不同的含义；因此，根据上下文，同一个词可能被分配不同的表示形式。 
+
+这激励了 * 上下文敏感的 * 单词表示形式的开发，其中单词的表示取决于它们的上下文。因此，令牌 $x$ 的上下文相关表示是函数 $f(x, c(x))$，具体取决于 $x$ 及其上下文 $c(x)$。受欢迎的上下文相关表示包括 tagLM（语言模型增强序列标记器）:cite:`Peters.Ammar.Bhagavatula.ea.2017`、Cove（上下文向量）:cite:`McCann.Bradbury.Xiong.ea.2017` 和 elMO（来自语言模型的嵌入）:cite:`Peters.Neumann.Iyyer.ea.2018`。 
+
+例如，通过将整个序列作为输入，elMO 是一个函数，它为输入序列中的每个单词分配一个表示形式。具体来说，elMO 将预训练的双向 LSTM 中的所有中间图层表示法合并为输出表示法。然后，elMO 表示法将作为附加功能添加到下游任务的现有监督模型中，例如连接现有模型中的 elMO 表示法和原始表示法（例如 GLOVE）。一方面，在添加 elMO 表示之后，预训练的双向 LSTM 模型中的所有权重都会被冻结。另一方面，现有的受监督模型是专门针对给定任务定制的。当时利用不同的最佳模型来处理不同的任务，增加 ELMO 改善了六个自然语言处理任务的最新状态：情绪分析、自然语言推断、语义角色标记、共引解析、命名实体识别和问题回答。 
+
+## 从特定于任务到不可知的任务
+
+尽管 elMO 显著改进了针对各种自然语言处理任务的解决方案，但每个解决方案仍然取决于 * 任务特定的 * 架构。但是，为每个自然语言处理任务设计一个特定的架构实际上并不平凡。GPT（生成预训练）模型代表着为上下文相关表示 :cite:`Radford.Narasimhan.Salimans.ea.2018` 设计一个通用 * 任务无关 * 模型的努力。GPT 建立在变压器解码器之上，预先训练将用于表示文本序列的语言模型。将 GPT 应用于下游任务时，语言模型的输出将被输入添加的线性输出图层，以预测任务的标签。与冻结预训练模型参数的 elMO 形成鲜明对比，GPT 在监督学习下游任务期间对预训练的变压器解码器中的 * 所有参数进行了微调。GPT 在自然语言推断、问答、句子相似性和分类等十二项任务上进行了评估，并在对模型架构的改动最小的情况下改善了其中 9 项任务的最新状态。 
+
+但是，由于语言模型的自回归性质，GPT 只是向前（从左到右）。在 “我去银行存款现金” 和 “我去银行坐下来” 的情况下，由于 “银行” 对左边的情境敏感，GPT 将返回 “银行” 的相同表述，尽管它有不同的含义。 
+
+## BERT：结合两全其美
+
+正如我们所看到的那样，elMO 以双向方式对上下文进行编码，但使用特定于任务的架构；虽然 GPT 与任务无关，但是对上下文进行了从左到右编码。BERT（来自变形金刚的双向编码器表示）结合了两全其美的结合，对于范围广泛的自然语言处理任务 :cite:`Devlin.Chang.Lee.ea.2018`，对于上下文的双向编码器表示法，只需最少的体系结构更改。使用预训练的变压器编码器，BERT 能够根据其双向上下文表示任何令牌。在监督下游任务学习期间，BERT 在两个方面与 GPT 类似。首先，BERT 表示将被输入添加的输出层，根据任务的性质对模型架构进行最小的更改，例如对每个令牌的预测与整个序列的预测。其次，预训练的变压器编码器的所有参数都经过微调，而额外的输出层将从头开始训练。:numref:`fig_elmo-gpt-bert` 描述了 elMO、GPT 和 BERT 之间的差异。 
+
+![A comparison of ELMo, GPT, and BERT.](../img/elmo-gpt-bert.svg)
+:label:`fig_elmo-gpt-bert`
+
+BERT 进一步改善了十一项自然语言处理任务的最新状态，这些类别包括：(i) 单一文本分类（例如情绪分析）、（ii）文本对分类（例如自然语言推理）、（iii）问答、（iv）文本标记（例如，指定实体识别）。所有这些都在 2018 年提出，从上下文敏感的 elMO 到与任务无关的 GPT 和 BERT，概念上简单但经验强大的自然语言深度表示预训练，彻底改变了各种自然语言处理任务的解决方案。 
+
+在本章的其余部分，我们将深入研究 BERT 的预培训。当 :numref:`chap_nlp_app` 中解释自然语言处理应用程序时，我们将说明对下游应用程序的 BERT 的微调。
+
+```{.python .input}
+from d2l import mxnet as d2l
+from mxnet import gluon, np, npx
+from mxnet.gluon import nn
+
+npx.set_np()
+```
+
+```{.python .input}
+#@tab pytorch
+from d2l import torch as d2l
+import torch
+from torch import nn
+```
+
+## 输入表示法
+:label:`subsec_bert_input_rep`
+
+在自然语言处理中，某些任务（例如情绪分析）将单个文本作为输入，而在其他一些任务（例如自然语言推断）中，输入是一对文本序列。BERT 输入序列明确表示单个文本对和文本对。在前者中，BERT 输入序列是特殊分类标记 “<cls>”、文本序列的标记和特殊分隔令牌 “<sep>” 的串联。在后者中，BERT 输入序列是 “<cls>”、第一个文本序列的标记 “<sep>”、第二个文本序列的标记和 “<sep>” 的连接。我们将始终将术语 “BERT 输入序列” 与其他类型的 “序列” 区分开来。例如，一个 *BERT 输入序列 * 可能包含一个 * 文本序列 * 或两个 * 文本序列 *。 
+
+为了区分文本对，学习的细分嵌入 $\mathbf{e}_A$ 和 $\mathbf{e}_B$ 分别添加到第一个序列和第二序列的令牌嵌入中。对于单个文本输入，只使用 $\mathbf{e}_A$。 
+
+以下 `get_tokens_and_segments` 以一句或两句话作为输入，然后返回 BERT 输入序列的标记及其对应的段 ID。
+
+```{.python .input}
+#@tab all
+#@save
+def get_tokens_and_segments(tokens_a, tokens_b=None):
+    """Get tokens of the BERT input sequence and their segment IDs."""
+    tokens = ['<cls>'] + tokens_a + ['<sep>']
+    # 0 and 1 are marking segment A and B, respectively
+    segments = [0] * (len(tokens_a) + 2)
+    if tokens_b is not None:
+        tokens += tokens_b + ['<sep>']
+        segments += [1] * (len(tokens_b) + 1)
+    return tokens, segments
+```
+
+BERT 选择变压器编码器作为其双向架构。在变压器编码器中常见，位置嵌入在 BERT 输入序列的每个位置都添加。但是，与原来的变压器编码器不同，BERT 使用 * 可学习 * 位置嵌入。总而言之，:numref:`fig_bert-input` 显示，BERT 输入序列的嵌入是令牌嵌入、区段嵌入和位置嵌入的总和。 
+
+![BERT 输入序列的嵌入是令牌嵌入、区段嵌入和位置嵌入的总和。](../img/bert-input.svg) :label:`fig_bert-input` 
+
+以下 `BERTEncoder` 类与 :numref:`sec_transformer` 中实施的 `TransformerEncoder` 类类似。与 `TransformerEncoder` 不同，`BERTEncoder` 使用细分嵌入和可学习的位置嵌入。
+
+```{.python .input}
+#@save
+class BERTEncoder(nn.Block):
+    """BERT encoder."""
+    def __init__(self, vocab_size, num_hiddens, ffn_num_hiddens, num_heads,
+                 num_layers, dropout, max_len=1000, **kwargs):
+        super(BERTEncoder, self).__init__(**kwargs)
+        self.token_embedding = nn.Embedding(vocab_size, num_hiddens)
+        self.segment_embedding = nn.Embedding(2, num_hiddens)
+        self.blks = nn.Sequential()
+        for _ in range(num_layers):
+            self.blks.add(d2l.EncoderBlock(
+                num_hiddens, ffn_num_hiddens, num_heads, dropout, True))
+        # In BERT, positional embeddings are learnable, thus we create a
+        # parameter of positional embeddings that are long enough
+        self.pos_embedding = self.params.get('pos_embedding',
+                                             shape=(1, max_len, num_hiddens))
+
+    def forward(self, tokens, segments, valid_lens):
+        # Shape of `X` remains unchanged in the following code snippet:
+        # (batch size, max sequence length, `num_hiddens`)
+        X = self.token_embedding(tokens) + self.segment_embedding(segments)
+        X = X + self.pos_embedding.data(ctx=X.ctx)[:, :X.shape[1], :]
+        for blk in self.blks:
+            X = blk(X, valid_lens)
+        return X
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+class BERTEncoder(nn.Module):
+    """BERT encoder."""
+    def __init__(self, vocab_size, num_hiddens, norm_shape, ffn_num_input,
+                 ffn_num_hiddens, num_heads, num_layers, dropout,
+                 max_len=1000, key_size=768, query_size=768, value_size=768,
+                 **kwargs):
+        super(BERTEncoder, self).__init__(**kwargs)
+        self.token_embedding = nn.Embedding(vocab_size, num_hiddens)
+        self.segment_embedding = nn.Embedding(2, num_hiddens)
+        self.blks = nn.Sequential()
+        for i in range(num_layers):
+            self.blks.add_module(f"{i}", d2l.EncoderBlock(
+                key_size, query_size, value_size, num_hiddens, norm_shape,
+                ffn_num_input, ffn_num_hiddens, num_heads, dropout, True))
+        # In BERT, positional embeddings are learnable, thus we create a
+        # parameter of positional embeddings that are long enough
+        self.pos_embedding = nn.Parameter(torch.randn(1, max_len,
+                                                      num_hiddens))
+
+    def forward(self, tokens, segments, valid_lens):
+        # Shape of `X` remains unchanged in the following code snippet:
+        # (batch size, max sequence length, `num_hiddens`)
+        X = self.token_embedding(tokens) + self.segment_embedding(segments)
+        X = X + self.pos_embedding.data[:, :X.shape[1], :]
+        for blk in self.blks:
+            X = blk(X, valid_lens)
+        return X
+```
+
+假设词汇量大小是 10000。为了演示 `BERTEncoder` 的前向推理，让我们创建它的实例并初始化其参数。
+
+```{.python .input}
+vocab_size, num_hiddens, ffn_num_hiddens, num_heads = 10000, 768, 1024, 4
+num_layers, dropout = 2, 0.2
+encoder = BERTEncoder(vocab_size, num_hiddens, ffn_num_hiddens, num_heads,
+                      num_layers, dropout)
+encoder.initialize()
+```
+
+```{.python .input}
+#@tab pytorch
+vocab_size, num_hiddens, ffn_num_hiddens, num_heads = 10000, 768, 1024, 4
+norm_shape, ffn_num_input, num_layers, dropout = [768], 768, 2, 0.2
+encoder = BERTEncoder(vocab_size, num_hiddens, norm_shape, ffn_num_input,
+                      ffn_num_hiddens, num_heads, num_layers, dropout)
+```
+
+我们将 `tokens` 定义为 2 个长度为 8 的 BERT 输入序列，其中每个标记都是词汇的索引。输入 `BERTEncoder` 的 `BERTEncoder` 和输入 `tokens` 返回编码结果，其中每个令牌由超参数 `num_hiddens` 预定义的向量表示，其长度由超参数 `num_hiddens` 预定义。此超参数通常称为变压器编码器的 * 隐藏大小 *（隐藏单位数）。
+
+```{.python .input}
+tokens = np.random.randint(0, vocab_size, (2, 8))
+segments = np.array([[0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 0, 1, 1, 1, 1, 1]])
+encoded_X = encoder(tokens, segments, None)
+encoded_X.shape
+```
+
+```{.python .input}
+#@tab pytorch
+tokens = torch.randint(0, vocab_size, (2, 8))
+segments = torch.tensor([[0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 0, 1, 1, 1, 1, 1]])
+encoded_X = encoder(tokens, segments, None)
+encoded_X.shape
+```
+
+## 培训前任务
+:label:`subsec_bert_pretraining_tasks`
+
+`BERTEncoder` 的前向推断给出了输入文本的每个令牌的 BERT 表示以及插入的特殊标记 “<cls>” 和 “<seq>”。接下来，我们将使用这些表示法来计算预训练 BERT 的损失函数。预培训由以下两项任务组成：蒙版语言建模和下一句话预测。 
+
+### 蒙面语言建模
+:label:`subsec_mlm`
+
+如 :numref:`sec_language_model` 所示，语言模型使用左侧的上下文来预测令牌。为了对上下文进行双向编码以表示每个令牌，BERT 会随机掩盖令牌，并使用双向上下文中的令牌以自我监督的方式预测被掩码的令牌。此任务被称为 * 蒙面语言模型 *。 
+
+在此预训任务中，15％ 的代币将随机选择作为预测的蒙面代币。要在不使用标签作弊的情况下预测蒙面的令牌，一种简单的方法是始终在 <mask>BERT 输入序列中用特殊的 “” 令牌替换它。但是，人为的特殊令牌 “<mask>” 永远不会出现在微调中。为避免预训和微调之间的这种不匹配，如果标记被掩盖进行预测（例如，在 “这部电影很棒” 中选择了 “很棒” 来掩盖和预测），则在输入内容中将替换为： 
+
+* <mask>80％ 的时间里，一个特殊的 “” 令牌（例如，“这部电影很棒” 变成 “这部电影是”<mask>）；
+* 10％ 的时间内随机令牌（例如，“这部电影很棒” 变成 “这部电影很喝”）；
+* 10％ 的时间内不变的标签令牌（例如，“这部电影很棒” 变成 “这部电影很棒”）。
+
+请注意，在 15％ 的时间里，插入随机令牌的 10％。这种偶尔的噪音鼓励 BERT 在双向上下文编码中减少对蒙面令牌的偏见（特别是当标签令牌保持不变时）。 
+
+我们实施了以下 `MaskLM` 课程来预测 BERT 预训的蒙面语言模型任务中的蒙面令牌。该预测使用一个隐藏层 MLP（`self.mlp`）。在前向推断中，它需要两个输入：`BERTEncoder` 的编码结果和用于预测的代币位置。输出是这些仓位的预测结果。
+
+```{.python .input}
+#@save
+class MaskLM(nn.Block):
+    """The masked language model task of BERT."""
+    def __init__(self, vocab_size, num_hiddens, **kwargs):
+        super(MaskLM, self).__init__(**kwargs)
+        self.mlp = nn.Sequential()
+        self.mlp.add(
+            nn.Dense(num_hiddens, flatten=False, activation='relu'))
+        self.mlp.add(nn.LayerNorm())
+        self.mlp.add(nn.Dense(vocab_size, flatten=False))
+
+    def forward(self, X, pred_positions):
+        num_pred_positions = pred_positions.shape[1]
+        pred_positions = pred_positions.reshape(-1)
+        batch_size = X.shape[0]
+        batch_idx = np.arange(0, batch_size)
+        # Suppose that `batch_size` = 2, `num_pred_positions` = 3, then
+        # `batch_idx` is `np.array([0, 0, 0, 1, 1, 1])`
+        batch_idx = np.repeat(batch_idx, num_pred_positions)
+        masked_X = X[batch_idx, pred_positions]
+        masked_X = masked_X.reshape((batch_size, num_pred_positions, -1))
+        mlm_Y_hat = self.mlp(masked_X)
+        return mlm_Y_hat
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+class MaskLM(nn.Module):
+    """The masked language model task of BERT."""
+    def __init__(self, vocab_size, num_hiddens, num_inputs=768, **kwargs):
+        super(MaskLM, self).__init__(**kwargs)
+        self.mlp = nn.Sequential(nn.Linear(num_inputs, num_hiddens),
+                                 nn.ReLU(),
+                                 nn.LayerNorm(num_hiddens),
+                                 nn.Linear(num_hiddens, vocab_size))
+
+    def forward(self, X, pred_positions):
+        num_pred_positions = pred_positions.shape[1]
+        pred_positions = pred_positions.reshape(-1)
+        batch_size = X.shape[0]
+        batch_idx = torch.arange(0, batch_size)
+        # Suppose that `batch_size` = 2, `num_pred_positions` = 3, then
+        # `batch_idx` is `torch.tensor([0, 0, 0, 1, 1, 1])`
+        batch_idx = torch.repeat_interleave(batch_idx, num_pred_positions)
+        masked_X = X[batch_idx, pred_positions]
+        masked_X = masked_X.reshape((batch_size, num_pred_positions, -1))
+        mlm_Y_hat = self.mlp(masked_X)
+        return mlm_Y_hat
+```
+
+为了演示 `MaskLM` 的前向推断，我们创建了它的实例 `mlm` 并对其进行初始化。回想一下，`encoded_X` 从前向推断 `BERTEncoder` 代表 2 个 BERT 输入序列。我们将 `mlm_positions` 定义为在 `encoded_X` 的 BERT 输入序列中要预测的 3 个指数。`mlm` 的前瞻推断回报预测结果为 `mlm_Y_hat`，在 `encoded_X` 的所有蒙面仓位 `mlm_positions`。对于每个预测，结果的大小等于词汇量大小。
+
+```{.python .input}
+mlm = MaskLM(vocab_size, num_hiddens)
+mlm.initialize()
+mlm_positions = np.array([[1, 5, 2], [6, 1, 5]])
+mlm_Y_hat = mlm(encoded_X, mlm_positions)
+mlm_Y_hat.shape
+```
+
+```{.python .input}
+#@tab pytorch
+mlm = MaskLM(vocab_size, num_hiddens)
+mlm_positions = torch.tensor([[1, 5, 2], [6, 1, 5]])
+mlm_Y_hat = mlm(encoded_X, mlm_positions)
+mlm_Y_hat.shape
+```
+
+通过掩码下的预测令牌 `mlm_Y_hat` 的地面真相标签 `mlm_Y`，我们可以计算 BERT 预训练中蒙面语言模型任务的交叉熵损失。
+
+```{.python .input}
+mlm_Y = np.array([[7, 8, 9], [10, 20, 30]])
+loss = gluon.loss.SoftmaxCrossEntropyLoss()
+mlm_l = loss(mlm_Y_hat.reshape((-1, vocab_size)), mlm_Y.reshape(-1))
+mlm_l.shape
+```
+
+```{.python .input}
+#@tab pytorch
+mlm_Y = torch.tensor([[7, 8, 9], [10, 20, 30]])
+loss = nn.CrossEntropyLoss(reduction='none')
+mlm_l = loss(mlm_Y_hat.reshape((-1, vocab_size)), mlm_Y.reshape(-1))
+mlm_l.shape
+```
+
+### 下一句预测
+:label:`subsec_nsp`
+
+虽然蒙版语言建模能够对表示单词的双向上下文进行编码，但它并没有明确建模文本对之间的逻辑关系。为了帮助理解两个文本序列之间的关系，BERT 在其预训中考虑了二进制分类任务 * 下一句预测 *。在为预训生成句子对时，有一半时间它们确实是带有 “True” 标签的连续句子；而另一半时间，第二个句子是从标有 “False” 标签的语料库中随机抽取的。 
+
+接下来的 `NextSentencePred` 类使用一个隐藏层 MLP 来预测第二句是否是 BERT 输入序列中第一句的下一句。由于变压器编码器中的自我注意力，特殊令牌 “<cls>” 的 BERT 表示对输入的两个句子进行了编码。因此，MLP 分类器的输出层 (`self.output`) 采用 `X` 作为输入，其中 `X` 是 MLP 隐藏层的输出，其输入是编码的 “<cls>” 令牌。
+
+```{.python .input}
+#@save
+class NextSentencePred(nn.Block):
+    """The next sentence prediction task of BERT."""
+    def __init__(self, **kwargs):
+        super(NextSentencePred, self).__init__(**kwargs)
+        self.output = nn.Dense(2)
+
+    def forward(self, X):
+        # `X` shape: (batch size, `num_hiddens`)
+        return self.output(X)
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+class NextSentencePred(nn.Module):
+    """The next sentence prediction task of BERT."""
+    def __init__(self, num_inputs, **kwargs):
+        super(NextSentencePred, self).__init__(**kwargs)
+        self.output = nn.Linear(num_inputs, 2)
+
+    def forward(self, X):
+        # `X` shape: (batch size, `num_hiddens`)
+        return self.output(X)
+```
+
+我们可以看到，`NextSentencePred` 实例的前向推断返回每个 BERT 输入序列的二进制预测。
+
+```{.python .input}
+nsp = NextSentencePred()
+nsp.initialize()
+nsp_Y_hat = nsp(encoded_X)
+nsp_Y_hat.shape
+```
+
+```{.python .input}
+#@tab pytorch
+# PyTorch by default won't flatten the tensor as seen in mxnet where, if
+# flatten=True, all but the first axis of input data are collapsed together
+encoded_X = torch.flatten(encoded_X, start_dim=1)
+# input_shape for NSP: (batch size, `num_hiddens`)
+nsp = NextSentencePred(encoded_X.shape[-1])
+nsp_Y_hat = nsp(encoded_X)
+nsp_Y_hat.shape
+```
+
+还可以计算两个二进制分类的交叉熵损失。
+
+```{.python .input}
+nsp_y = np.array([0, 1])
+nsp_l = loss(nsp_Y_hat, nsp_y)
+nsp_l.shape
+```
+
+```{.python .input}
+#@tab pytorch
+nsp_y = torch.tensor([0, 1])
+nsp_l = loss(nsp_Y_hat, nsp_y)
+nsp_l.shape
+```
+
+值得注意的是，上述两项预培训任务中的所有标签都可以在没有人工标签的情况下从培训前语料库中轻而易举地获得。原来的 BERT 已经在 Bookcorpus :cite:`Zhu.Kiros.Zemel.ea.2015` 和英语维基百科的连接方面进行了预培训。这两个文本语句是巨大的：它们分别有 8 亿个单词和 25 亿个单词。 
+
+## 把所有东西放在一起
+
+在预训练 BERT 时，最终损失函数是掩码语言建模的损失函数和下一句预测的线性组合。现在我们可以通过实例化三个类 `BERTEncoder`、`MaskLM` 和 `NextSentencePred` 来定义 `BERTModel` 类。前向推理返回编码的 BERT 表示 `encoded_X`、对蒙面语言建模 `mlm_Y_hat` 的预测以及下一句预测 `nsp_Y_hat`。
+
+```{.python .input}
+#@save
+class BERTModel(nn.Block):
+    """The BERT model."""
+    def __init__(self, vocab_size, num_hiddens, ffn_num_hiddens, num_heads,
+                 num_layers, dropout, max_len=1000):
+        super(BERTModel, self).__init__()
+        self.encoder = BERTEncoder(vocab_size, num_hiddens, ffn_num_hiddens,
+                                   num_heads, num_layers, dropout, max_len)
+        self.hidden = nn.Dense(num_hiddens, activation='tanh')
+        self.mlm = MaskLM(vocab_size, num_hiddens)
+        self.nsp = NextSentencePred()
+
+    def forward(self, tokens, segments, valid_lens=None, pred_positions=None):
+        encoded_X = self.encoder(tokens, segments, valid_lens)
+        if pred_positions is not None:
+            mlm_Y_hat = self.mlm(encoded_X, pred_positions)
+        else:
+            mlm_Y_hat = None
+        # The hidden layer of the MLP classifier for next sentence prediction.
+        # 0 is the index of the '<cls>' token
+        nsp_Y_hat = self.nsp(self.hidden(encoded_X[:, 0, :]))
+        return encoded_X, mlm_Y_hat, nsp_Y_hat
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+class BERTModel(nn.Module):
+    """The BERT model."""
+    def __init__(self, vocab_size, num_hiddens, norm_shape, ffn_num_input,
+                 ffn_num_hiddens, num_heads, num_layers, dropout,
+                 max_len=1000, key_size=768, query_size=768, value_size=768,
+                 hid_in_features=768, mlm_in_features=768,
+                 nsp_in_features=768):
+        super(BERTModel, self).__init__()
+        self.encoder = BERTEncoder(vocab_size, num_hiddens, norm_shape,
+                    ffn_num_input, ffn_num_hiddens, num_heads, num_layers,
+                    dropout, max_len=max_len, key_size=key_size,
+                    query_size=query_size, value_size=value_size)
+        self.hidden = nn.Sequential(nn.Linear(hid_in_features, num_hiddens),
+                                    nn.Tanh())
+        self.mlm = MaskLM(vocab_size, num_hiddens, mlm_in_features)
+        self.nsp = NextSentencePred(nsp_in_features)
+
+    def forward(self, tokens, segments, valid_lens=None, pred_positions=None):
+        encoded_X = self.encoder(tokens, segments, valid_lens)
+        if pred_positions is not None:
+            mlm_Y_hat = self.mlm(encoded_X, pred_positions)
+        else:
+            mlm_Y_hat = None
+        # The hidden layer of the MLP classifier for next sentence prediction.
+        # 0 is the index of the '<cls>' token
+        nsp_Y_hat = self.nsp(self.hidden(encoded_X[:, 0, :]))
+        return encoded_X, mlm_Y_hat, nsp_Y_hat
+```
+
+## 摘要
+
+* Word2vec 和 Glove 等单词嵌入模型与上下文无关。无论单词的上下文如何（如果有），它们都会将相同的预训练向量分配给同一个单词。他们很难以很好地处理自然语言中的多聚结或复杂的语义。
+* 对于上下文相关的单词表示（例如 elMO 和 GPT），单词的表示取决于它们的上下文。
+* elMO 以双向方式对上下文进行编码，但使用特定于任务的架构（但是，为每个自然语言处理任务设计一个特定的架构实际上并不平凡）；而 GPT 与任务无关，但是从左到右编码上下文。
+* BERT 结合了两全其美：它以双向方式编码上下文，对于各种自然语言处理任务，只需最少的体系结构更改。
+* BERT 输入序列的嵌入是令牌嵌入、区段嵌入和位置嵌入的总和。
+* 培训前 BERT 由两项任务组成：蒙面语言建模和下一句话预测。前者能够对表示单词的双向上下文进行编码，而后者则明确建模文本对之间的逻辑关系。
+
+## 练习
+
+1. 为什么 BERT 会成功？
+1. 所有其他事情都相同，蒙面语言模型是否需要比从左到右语言模型需要更多或更少的预训步骤才能收敛？为什么？
+1. 在 BERT 的最初实现中，`BERTEncoder`（通过 `d2l.EncoderBlock`）中的定位前馈网络和 `MaskLM` 中的完全连接层都使用高斯误差线性单元 (GELU) :cite:`Hendrycks.Gimpel.2016` 作为激活函数。研究 GELU 和 RELU 之间的区别。
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/388)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/1490)
+:end_tab:
diff --git a/chapter_natural-language-processing-pretraining/bert_origin.md b/chapter_natural-language-processing-pretraining/bert_origin.md
new file mode 100644
index 000000000..d9c37c784
--- /dev/null
+++ b/chapter_natural-language-processing-pretraining/bert_origin.md
@@ -0,0 +1,614 @@
+# Bidirectional Encoder Representations from Transformers (BERT)
+:label:`sec_bert`
+
+We have introduced several word embedding models for natural language understanding.
+After pretraining, the output can be thought of as a matrix
+where each row is a vector that represents a word of a predefined vocabulary.
+In fact, these word embedding models are all *context-independent*.
+Let us begin by illustrating this property.
+
+
+## From Context-Independent to Context-Sensitive
+
+Recall the experiments in :numref:`sec_word2vec_pretraining` and :numref:`sec_synonyms`.
+For instance, word2vec and GloVe both assign the same pretrained vector to the same word regardless of the context of the word (if any).
+Formally, a context-independent representation of any token $x$
+is a function $f(x)$ that only takes $x$ as its input.
+Given the abundance of polysemy and complex semantics in natural languages,
+context-independent representations have obvious limitations.
+For instance, the word "crane" in contexts
+"a crane is flying" and "a crane driver came" has completely different meanings;
+thus, the same word may be assigned different representations depending on contexts.
+
+This motivates the development of *context-sensitive* word representations,
+where representations of words depend on their contexts.
+Hence, a context-sensitive representation of token $x$ is a function $f(x, c(x))$
+depending on both $x$ and its context $c(x)$.
+Popular context-sensitive representations
+include TagLM (language-model-augmented sequence tagger) :cite:`Peters.Ammar.Bhagavatula.ea.2017`,
+CoVe (Context Vectors) :cite:`McCann.Bradbury.Xiong.ea.2017`,
+and ELMo (Embeddings from Language Models) :cite:`Peters.Neumann.Iyyer.ea.2018`.
+
+For example, by taking the entire sequence as the input,
+ELMo is a function that assigns a representation to each word from the input sequence.
+Specifically, ELMo combines all the intermediate layer representations from pretrained bidirectional LSTM as the output representation.
+Then the ELMo representation will be added to a downstream task's existing supervised model
+as additional features, such as by concatenating ELMo representation and the original representation (e.g., GloVe) of tokens in the existing model.
+On one hand,
+all the weights in the pretrained bidirectional LSTM model are frozen after ELMo representations are added.
+On the other hand,
+the existing supervised model is specifically customized for a given task.
+Leveraging different best models for different tasks at that time,
+adding ELMo improved the state of the art across six natural language processing tasks:
+sentiment analysis, natural language inference,
+semantic role labeling, coreference resolution,
+named entity recognition, and question answering.
+
+
+## From Task-Specific to Task-Agnostic
+
+Although ELMo has significantly improved solutions to a diverse set of natural language processing tasks,
+each solution still hinges on a *task-specific* architecture.
+However, it is practically non-trivial to craft a specific architecture for every natural language processing task.
+The GPT (Generative Pre-Training) model represents an effort in designing
+a general *task-agnostic* model for context-sensitive representations :cite:`Radford.Narasimhan.Salimans.ea.2018`.
+Built on a transformer decoder,
+GPT pretrains a language model that will be used to represent text sequences.
+When applying GPT to a downstream task,
+the output of the language model will be fed into an added linear output layer
+to predict the label of the task.
+In sharp contrast to ELMo that freezes parameters of the pretrained model,
+GPT fine-tunes *all* the parameters in the pretrained transformer decoder
+during supervised learning of the downstream task.
+GPT was evaluated on twelve tasks of natural language inference,
+question answering, sentence similarity, and classification,
+and improved the state of the art in nine of them with minimal changes
+to the model architecture.
+
+However, due to the autoregressive nature of language models,
+GPT only looks forward (left-to-right).
+In contexts "i went to the bank to deposit cash" and "i went to the bank to sit down",
+as "bank" is sensitive to the context to its left,
+GPT will return the same representation for "bank",
+though it has different meanings.
+
+
+## BERT: Combining the Best of Both Worlds
+
+As we have seen,
+ELMo encodes context bidirectionally but uses task-specific architectures;
+while GPT is task-agnostic but encodes context left-to-right.
+Combining the best of both worlds,
+BERT (Bidirectional Encoder Representations from Transformers)
+encodes context bidirectionally and requires minimal architecture changes
+for a wide range of natural language processing tasks :cite:`Devlin.Chang.Lee.ea.2018`.
+Using a pretrained transformer encoder,
+BERT is able to represent any token based on its bidirectional context.
+During supervised learning of downstream tasks,
+BERT is similar to GPT in two aspects.
+First, BERT representations will be fed into an added output layer,
+with minimal changes to the model architecture depending on nature of tasks,
+such as predicting for every token vs. predicting for the entire sequence.
+Second,
+all the parameters of the pretrained transformer encoder are fine-tuned,
+while the additional output layer will be trained from scratch.
+:numref:`fig_elmo-gpt-bert` depicts the differences among ELMo, GPT, and BERT.
+
+![A comparison of ELMo, GPT, and BERT.](../img/elmo-gpt-bert.svg)
+:label:`fig_elmo-gpt-bert`
+
+
+BERT further improved the state of the art on eleven natural language processing tasks
+under broad categories of (i) single text classification (e.g., sentiment analysis), (ii) text pair classification (e.g., natural language inference),
+(iii) question answering, (iv) text tagging (e.g., named entity recognition).
+All proposed in 2018,
+from context-sensitive ELMo to task-agnostic GPT and BERT,
+conceptually simple yet empirically powerful pretraining of deep representations for natural languages have revolutionized solutions to various natural language processing tasks.
+
+In the rest of this chapter,
+we will dive into the pretraining of BERT.
+When natural language processing applications are explained in :numref:`chap_nlp_app`,
+we will illustrate fine-tuning of BERT for downstream applications.
+
+```{.python .input}
+from d2l import mxnet as d2l
+from mxnet import gluon, np, npx
+from mxnet.gluon import nn
+
+npx.set_np()
+```
+
+```{.python .input}
+#@tab pytorch
+from d2l import torch as d2l
+import torch
+from torch import nn
+```
+
+## Input Representation
+:label:`subsec_bert_input_rep`
+
+In natural language processing,
+some tasks (e.g., sentiment analysis) take single text as the input,
+while in some other tasks (e.g., natural language inference),
+the input is a pair of text sequences.
+The BERT input sequence unambiguously represents both single text and text pairs.
+In the former,
+the BERT input sequence is the concatenation of
+the special classification token “&lt;cls&gt;”,
+tokens of a text sequence,
+and the special separation token “&lt;sep&gt;”.
+In the latter,
+the BERT input sequence is the concatenation of
+“&lt;cls&gt;”, tokens of the first text sequence,
+“&lt;sep&gt;”, tokens of the second text sequence, and “&lt;sep&gt;”.
+We will consistently distinguish the terminology "BERT input sequence"
+from other types of "sequences".
+For instance, one *BERT input sequence* may include either one *text sequence* or two *text sequences*.
+
+To distinguish text pairs,
+the learned segment embeddings $\mathbf{e}_A$ and $\mathbf{e}_B$
+are added to the token embeddings of the first sequence and the second sequence, respectively.
+For single text inputs, only $\mathbf{e}_A$ is used.
+
+The following `get_tokens_and_segments` takes either one sentence or two sentences
+as the input, then returns tokens of the BERT input sequence
+and their corresponding segment IDs.
+
+```{.python .input}
+#@tab all
+#@save
+def get_tokens_and_segments(tokens_a, tokens_b=None):
+    """Get tokens of the BERT input sequence and their segment IDs."""
+    tokens = ['<cls>'] + tokens_a + ['<sep>']
+    # 0 and 1 are marking segment A and B, respectively
+    segments = [0] * (len(tokens_a) + 2)
+    if tokens_b is not None:
+        tokens += tokens_b + ['<sep>']
+        segments += [1] * (len(tokens_b) + 1)
+    return tokens, segments
+```
+
+BERT chooses the transformer encoder as its bidirectional architecture.
+Common in the transformer encoder,
+positional embeddings are added at every position of the BERT input sequence.
+However, different from the original transformer encoder,
+BERT uses *learnable* positional embeddings.
+To sum up, :numref:`fig_bert-input` shows that
+the embeddings of the BERT input sequence are the sum
+of the token embeddings, segment embeddings, and positional embeddings.
+
+![The embeddings of the BERT input sequence are the sum
+of the token embeddings, segment embeddings, and positional embeddings.](../img/bert-input.svg)
+:label:`fig_bert-input`
+
+The following `BERTEncoder` class is similar to the `TransformerEncoder` class
+as implemented in :numref:`sec_transformer`.
+Different from `TransformerEncoder`, `BERTEncoder` uses
+segment embeddings and learnable positional embeddings.
+
+```{.python .input}
+#@save
+class BERTEncoder(nn.Block):
+    """BERT encoder."""
+    def __init__(self, vocab_size, num_hiddens, ffn_num_hiddens, num_heads,
+                 num_layers, dropout, max_len=1000, **kwargs):
+        super(BERTEncoder, self).__init__(**kwargs)
+        self.token_embedding = nn.Embedding(vocab_size, num_hiddens)
+        self.segment_embedding = nn.Embedding(2, num_hiddens)
+        self.blks = nn.Sequential()
+        for _ in range(num_layers):
+            self.blks.add(d2l.EncoderBlock(
+                num_hiddens, ffn_num_hiddens, num_heads, dropout, True))
+        # In BERT, positional embeddings are learnable, thus we create a
+        # parameter of positional embeddings that are long enough
+        self.pos_embedding = self.params.get('pos_embedding',
+                                             shape=(1, max_len, num_hiddens))
+
+    def forward(self, tokens, segments, valid_lens):
+        # Shape of `X` remains unchanged in the following code snippet:
+        # (batch size, max sequence length, `num_hiddens`)
+        X = self.token_embedding(tokens) + self.segment_embedding(segments)
+        X = X + self.pos_embedding.data(ctx=X.ctx)[:, :X.shape[1], :]
+        for blk in self.blks:
+            X = blk(X, valid_lens)
+        return X
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+class BERTEncoder(nn.Module):
+    """BERT encoder."""
+    def __init__(self, vocab_size, num_hiddens, norm_shape, ffn_num_input,
+                 ffn_num_hiddens, num_heads, num_layers, dropout,
+                 max_len=1000, key_size=768, query_size=768, value_size=768,
+                 **kwargs):
+        super(BERTEncoder, self).__init__(**kwargs)
+        self.token_embedding = nn.Embedding(vocab_size, num_hiddens)
+        self.segment_embedding = nn.Embedding(2, num_hiddens)
+        self.blks = nn.Sequential()
+        for i in range(num_layers):
+            self.blks.add_module(f"{i}", d2l.EncoderBlock(
+                key_size, query_size, value_size, num_hiddens, norm_shape,
+                ffn_num_input, ffn_num_hiddens, num_heads, dropout, True))
+        # In BERT, positional embeddings are learnable, thus we create a
+        # parameter of positional embeddings that are long enough
+        self.pos_embedding = nn.Parameter(torch.randn(1, max_len,
+                                                      num_hiddens))
+
+    def forward(self, tokens, segments, valid_lens):
+        # Shape of `X` remains unchanged in the following code snippet:
+        # (batch size, max sequence length, `num_hiddens`)
+        X = self.token_embedding(tokens) + self.segment_embedding(segments)
+        X = X + self.pos_embedding.data[:, :X.shape[1], :]
+        for blk in self.blks:
+            X = blk(X, valid_lens)
+        return X
+```
+
+Suppose that the vocabulary size is 10000.
+To demonstrate forward inference of `BERTEncoder`,
+let us create an instance of it and initialize its parameters.
+
+```{.python .input}
+vocab_size, num_hiddens, ffn_num_hiddens, num_heads = 10000, 768, 1024, 4
+num_layers, dropout = 2, 0.2
+encoder = BERTEncoder(vocab_size, num_hiddens, ffn_num_hiddens, num_heads,
+                      num_layers, dropout)
+encoder.initialize()
+```
+
+```{.python .input}
+#@tab pytorch
+vocab_size, num_hiddens, ffn_num_hiddens, num_heads = 10000, 768, 1024, 4
+norm_shape, ffn_num_input, num_layers, dropout = [768], 768, 2, 0.2
+encoder = BERTEncoder(vocab_size, num_hiddens, norm_shape, ffn_num_input,
+                      ffn_num_hiddens, num_heads, num_layers, dropout)
+```
+
+We define `tokens` to be 2 BERT input sequences of length 8,
+where each token is an index of the vocabulary.
+The forward inference of `BERTEncoder` with the input `tokens`
+returns the encoded result where each token is represented by a vector
+whose length is predefined by the hyperparameter `num_hiddens`.
+This hyperparameter is usually referred to as the *hidden size*
+(number of hidden units) of the transformer encoder.
+
+```{.python .input}
+tokens = np.random.randint(0, vocab_size, (2, 8))
+segments = np.array([[0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 0, 1, 1, 1, 1, 1]])
+encoded_X = encoder(tokens, segments, None)
+encoded_X.shape
+```
+
+```{.python .input}
+#@tab pytorch
+tokens = torch.randint(0, vocab_size, (2, 8))
+segments = torch.tensor([[0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 0, 1, 1, 1, 1, 1]])
+encoded_X = encoder(tokens, segments, None)
+encoded_X.shape
+```
+
+## Pretraining Tasks
+:label:`subsec_bert_pretraining_tasks`
+
+The forward inference of `BERTEncoder` gives the BERT representation
+of each token of the input text and the inserted
+special tokens “&lt;cls&gt;” and “&lt;seq&gt;”.
+Next, we will use these representations to compute the loss function
+for pretraining BERT.
+The pretraining is composed of the following two tasks:
+masked language modeling and next sentence prediction.
+
+### Masked Language Modeling
+:label:`subsec_mlm`
+
+As illustrated in :numref:`sec_language_model`,
+a language model predicts a token using the context on its left.
+To encode context bidirectionally for representing each token,
+BERT randomly masks tokens and uses tokens from the bidirectional context to
+predict the masked tokens in a self-supervised fashion.
+This task is referred to as a *masked language model*.
+
+In this pretraining task,
+15% of tokens will be selected at random as the masked tokens for prediction.
+To predict a masked token without cheating by using the label,
+one straightforward approach is to always replace it with a special “&lt;mask&gt;” token in the BERT input sequence.
+However, the artificial special token “&lt;mask&gt;” will never appear
+in fine-tuning.
+To avoid such a mismatch between pretraining and fine-tuning,
+if a token is masked for prediction (e.g., "great" is selected to be masked and predicted in "this movie is great"),
+in the input it will be replaced with:
+
+* a special “&lt;mask&gt;” token for 80% of the time (e.g., "this movie is great" becomes "this movie is &lt;mask&gt;");
+* a random token for 10% of the time (e.g., "this movie is great" becomes "this movie is drink");
+* the unchanged label token for 10% of the time (e.g., "this movie is great" becomes "this movie is great").
+
+Note that for 10% of 15% time a random token is inserted.
+This occasional noise encourages BERT to be less biased towards the masked token (especially when the label token remains unchanged) in its bidirectional context encoding.
+
+We implement the following `MaskLM` class to predict masked tokens
+in the masked language model task of BERT pretraining.
+The prediction uses a one-hidden-layer MLP (`self.mlp`).
+In forward inference, it takes two inputs:
+the encoded result of `BERTEncoder` and the token positions for prediction.
+The output is the prediction results at these positions.
+
+```{.python .input}
+#@save
+class MaskLM(nn.Block):
+    """The masked language model task of BERT."""
+    def __init__(self, vocab_size, num_hiddens, **kwargs):
+        super(MaskLM, self).__init__(**kwargs)
+        self.mlp = nn.Sequential()
+        self.mlp.add(
+            nn.Dense(num_hiddens, flatten=False, activation='relu'))
+        self.mlp.add(nn.LayerNorm())
+        self.mlp.add(nn.Dense(vocab_size, flatten=False))
+
+    def forward(self, X, pred_positions):
+        num_pred_positions = pred_positions.shape[1]
+        pred_positions = pred_positions.reshape(-1)
+        batch_size = X.shape[0]
+        batch_idx = np.arange(0, batch_size)
+        # Suppose that `batch_size` = 2, `num_pred_positions` = 3, then
+        # `batch_idx` is `np.array([0, 0, 0, 1, 1, 1])`
+        batch_idx = np.repeat(batch_idx, num_pred_positions)
+        masked_X = X[batch_idx, pred_positions]
+        masked_X = masked_X.reshape((batch_size, num_pred_positions, -1))
+        mlm_Y_hat = self.mlp(masked_X)
+        return mlm_Y_hat
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+class MaskLM(nn.Module):
+    """The masked language model task of BERT."""
+    def __init__(self, vocab_size, num_hiddens, num_inputs=768, **kwargs):
+        super(MaskLM, self).__init__(**kwargs)
+        self.mlp = nn.Sequential(nn.Linear(num_inputs, num_hiddens),
+                                 nn.ReLU(),
+                                 nn.LayerNorm(num_hiddens),
+                                 nn.Linear(num_hiddens, vocab_size))
+
+    def forward(self, X, pred_positions):
+        num_pred_positions = pred_positions.shape[1]
+        pred_positions = pred_positions.reshape(-1)
+        batch_size = X.shape[0]
+        batch_idx = torch.arange(0, batch_size)
+        # Suppose that `batch_size` = 2, `num_pred_positions` = 3, then
+        # `batch_idx` is `torch.tensor([0, 0, 0, 1, 1, 1])`
+        batch_idx = torch.repeat_interleave(batch_idx, num_pred_positions)
+        masked_X = X[batch_idx, pred_positions]
+        masked_X = masked_X.reshape((batch_size, num_pred_positions, -1))
+        mlm_Y_hat = self.mlp(masked_X)
+        return mlm_Y_hat
+```
+
+To demonstrate the forward inference of `MaskLM`,
+we create its instance `mlm` and initialize it.
+Recall that `encoded_X` from the forward inference of `BERTEncoder`
+represents 2 BERT input sequences.
+We define `mlm_positions` as the 3 indices to predict in either BERT input sequence of `encoded_X`.
+The forward inference of `mlm` returns prediction results `mlm_Y_hat`
+at all the masked positions `mlm_positions` of `encoded_X`.
+For each prediction, the size of the result is equal to the vocabulary size.
+
+```{.python .input}
+mlm = MaskLM(vocab_size, num_hiddens)
+mlm.initialize()
+mlm_positions = np.array([[1, 5, 2], [6, 1, 5]])
+mlm_Y_hat = mlm(encoded_X, mlm_positions)
+mlm_Y_hat.shape
+```
+
+```{.python .input}
+#@tab pytorch
+mlm = MaskLM(vocab_size, num_hiddens)
+mlm_positions = torch.tensor([[1, 5, 2], [6, 1, 5]])
+mlm_Y_hat = mlm(encoded_X, mlm_positions)
+mlm_Y_hat.shape
+```
+
+With the ground truth labels `mlm_Y` of the predicted tokens `mlm_Y_hat` under masks,
+we can calculate the cross-entropy loss of the masked language model task in BERT pretraining.
+
+```{.python .input}
+mlm_Y = np.array([[7, 8, 9], [10, 20, 30]])
+loss = gluon.loss.SoftmaxCrossEntropyLoss()
+mlm_l = loss(mlm_Y_hat.reshape((-1, vocab_size)), mlm_Y.reshape(-1))
+mlm_l.shape
+```
+
+```{.python .input}
+#@tab pytorch
+mlm_Y = torch.tensor([[7, 8, 9], [10, 20, 30]])
+loss = nn.CrossEntropyLoss(reduction='none')
+mlm_l = loss(mlm_Y_hat.reshape((-1, vocab_size)), mlm_Y.reshape(-1))
+mlm_l.shape
+```
+
+### Next Sentence Prediction
+:label:`subsec_nsp`
+
+Although masked language modeling is able to encode bidirectional context
+for representing words, it does not explicitly model the logical relationship
+between text pairs.
+To help understand the relationship between two text sequences,
+BERT considers a binary classification task, *next sentence prediction*, in its pretraining.
+When generating sentence pairs for pretraining,
+for half of the time they are indeed consecutive sentences with the label "True";
+while for the other half of the time the second sentence is randomly sampled from the corpus with the label "False".
+
+The following `NextSentencePred` class uses a one-hidden-layer MLP
+to predict whether the second sentence is the next sentence of the first
+in the BERT input sequence.
+Due to self-attention in the transformer encoder,
+the BERT representation of the special token “&lt;cls&gt;”
+encodes both the two sentences from the input.
+Hence, the output layer (`self.output`) of the MLP classifier takes `X` as the input,
+where `X` is the output of the MLP hidden layer whose input is the encoded “&lt;cls&gt;” token.
+
+```{.python .input}
+#@save
+class NextSentencePred(nn.Block):
+    """The next sentence prediction task of BERT."""
+    def __init__(self, **kwargs):
+        super(NextSentencePred, self).__init__(**kwargs)
+        self.output = nn.Dense(2)
+
+    def forward(self, X):
+        # `X` shape: (batch size, `num_hiddens`)
+        return self.output(X)
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+class NextSentencePred(nn.Module):
+    """The next sentence prediction task of BERT."""
+    def __init__(self, num_inputs, **kwargs):
+        super(NextSentencePred, self).__init__(**kwargs)
+        self.output = nn.Linear(num_inputs, 2)
+
+    def forward(self, X):
+        # `X` shape: (batch size, `num_hiddens`)
+        return self.output(X)
+```
+
+We can see that the forward inference of an `NextSentencePred` instance
+returns binary predictions for each BERT input sequence.
+
+```{.python .input}
+nsp = NextSentencePred()
+nsp.initialize()
+nsp_Y_hat = nsp(encoded_X)
+nsp_Y_hat.shape
+```
+
+```{.python .input}
+#@tab pytorch
+# PyTorch by default won't flatten the tensor as seen in mxnet where, if
+# flatten=True, all but the first axis of input data are collapsed together
+encoded_X = torch.flatten(encoded_X, start_dim=1)
+# input_shape for NSP: (batch size, `num_hiddens`)
+nsp = NextSentencePred(encoded_X.shape[-1])
+nsp_Y_hat = nsp(encoded_X)
+nsp_Y_hat.shape
+```
+
+The cross-entropy loss of the 2 binary classifications can also be computed.
+
+```{.python .input}
+nsp_y = np.array([0, 1])
+nsp_l = loss(nsp_Y_hat, nsp_y)
+nsp_l.shape
+```
+
+```{.python .input}
+#@tab pytorch
+nsp_y = torch.tensor([0, 1])
+nsp_l = loss(nsp_Y_hat, nsp_y)
+nsp_l.shape
+```
+
+It is noteworthy that all the labels in both the aforementioned pretraining tasks
+can be trivially obtained from the pretraining corpus without manual labeling effort.
+The original BERT has been pretrained on the concatenation of BookCorpus :cite:`Zhu.Kiros.Zemel.ea.2015`
+and English Wikipedia.
+These two text corpora are huge:
+they have 800 million words and 2.5 billion words, respectively.
+
+
+## Putting All Things Together
+
+When pretraining BERT, the final loss function is a linear combination of
+both the loss functions for masked language modeling and next sentence prediction.
+Now we can define the `BERTModel` class by instantiating the three classes
+`BERTEncoder`, `MaskLM`, and `NextSentencePred`.
+The forward inference returns the encoded BERT representations `encoded_X`,
+predictions of masked language modeling `mlm_Y_hat`,
+and next sentence predictions `nsp_Y_hat`.
+
+```{.python .input}
+#@save
+class BERTModel(nn.Block):
+    """The BERT model."""
+    def __init__(self, vocab_size, num_hiddens, ffn_num_hiddens, num_heads,
+                 num_layers, dropout, max_len=1000):
+        super(BERTModel, self).__init__()
+        self.encoder = BERTEncoder(vocab_size, num_hiddens, ffn_num_hiddens,
+                                   num_heads, num_layers, dropout, max_len)
+        self.hidden = nn.Dense(num_hiddens, activation='tanh')
+        self.mlm = MaskLM(vocab_size, num_hiddens)
+        self.nsp = NextSentencePred()
+
+    def forward(self, tokens, segments, valid_lens=None, pred_positions=None):
+        encoded_X = self.encoder(tokens, segments, valid_lens)
+        if pred_positions is not None:
+            mlm_Y_hat = self.mlm(encoded_X, pred_positions)
+        else:
+            mlm_Y_hat = None
+        # The hidden layer of the MLP classifier for next sentence prediction.
+        # 0 is the index of the '<cls>' token
+        nsp_Y_hat = self.nsp(self.hidden(encoded_X[:, 0, :]))
+        return encoded_X, mlm_Y_hat, nsp_Y_hat
+```
+
+```{.python .input}
+#@tab pytorch
+#@save
+class BERTModel(nn.Module):
+    """The BERT model."""
+    def __init__(self, vocab_size, num_hiddens, norm_shape, ffn_num_input,
+                 ffn_num_hiddens, num_heads, num_layers, dropout,
+                 max_len=1000, key_size=768, query_size=768, value_size=768,
+                 hid_in_features=768, mlm_in_features=768,
+                 nsp_in_features=768):
+        super(BERTModel, self).__init__()
+        self.encoder = BERTEncoder(vocab_size, num_hiddens, norm_shape,
+                    ffn_num_input, ffn_num_hiddens, num_heads, num_layers,
+                    dropout, max_len=max_len, key_size=key_size,
+                    query_size=query_size, value_size=value_size)
+        self.hidden = nn.Sequential(nn.Linear(hid_in_features, num_hiddens),
+                                    nn.Tanh())
+        self.mlm = MaskLM(vocab_size, num_hiddens, mlm_in_features)
+        self.nsp = NextSentencePred(nsp_in_features)
+
+    def forward(self, tokens, segments, valid_lens=None, pred_positions=None):
+        encoded_X = self.encoder(tokens, segments, valid_lens)
+        if pred_positions is not None:
+            mlm_Y_hat = self.mlm(encoded_X, pred_positions)
+        else:
+            mlm_Y_hat = None
+        # The hidden layer of the MLP classifier for next sentence prediction.
+        # 0 is the index of the '<cls>' token
+        nsp_Y_hat = self.nsp(self.hidden(encoded_X[:, 0, :]))
+        return encoded_X, mlm_Y_hat, nsp_Y_hat
+```
+
+## Summary
+
+* Word embedding models such as word2vec and GloVe are context-independent. They assign the same pretrained vector to the same word regardless of the context of the word (if any). It is hard for them to handle well polysemy or complex semantics in natural languages.
+* For context-sensitive word representations such as ELMo and GPT, representations of words depend on their contexts.
+* ELMo encodes context bidirectionally but uses task-specific architectures (however, it is practically non-trivial to craft a specific architecture for every natural language processing task); while GPT is task-agnostic but encodes context left-to-right.
+* BERT combines the best of both worlds: it encodes context bidirectionally and requires minimal architecture changes for a wide range of natural language processing tasks.
+* The embeddings of the BERT input sequence are the sum of the token embeddings, segment embeddings, and positional embeddings.
+* Pretraining BERT is composed of two tasks: masked language modeling and next sentence prediction. The former is able to encode bidirectional context for representing words, while the latter explicitly models the logical relationship between text pairs.
+
+
+## Exercises
+
+1. Why does BERT succeed?
+1. All other things being equal, will a masked language model require more or fewer pretraining steps to converge than a left-to-right language model? Why?
+1. In the original implementation of BERT, the positionwise feed-forward network in `BERTEncoder` (via `d2l.EncoderBlock`) and the fully-connected layer in `MaskLM` both use the Gaussian error linear unit (GELU) :cite:`Hendrycks.Gimpel.2016` as the activation function. Research into the difference between GELU and ReLU.
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/388)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/1490)
+:end_tab:
diff --git a/chapter_natural-language-processing-pretraining/glove.md b/chapter_natural-language-processing-pretraining/glove.md
new file mode 100644
index 000000000..2bf683fa2
--- /dev/null
+++ b/chapter_natural-language-processing-pretraining/glove.md
@@ -0,0 +1,97 @@
+# 带全局向量的单词嵌入 (GLOVE)
+:label:`sec_glove`
+
+上下文窗口中的单词同时出现可能会带有丰富的语义信息。例如，在大型语料库中，“固体” 一词比 “蒸汽” 更有可能与 “冰” 共存，但是 “气” 一词可能与 “蒸汽” 共同出现的频率比 “冰” 更频繁。此外，可以预先计算此类同时出现的全球语料库统计数据：这可以提高培训效率。为了利用整个语料库中的统计信息进行单词嵌入，让我们首先重温 :numref:`subsec_skip-gram` 中的跳过图模型，但是使用全局语料库统计数（例如共生计数）来解释它。 
+
+## 跳过 Gram 与全球语料库统计
+:label:`subsec_skipgram-global`
+
+以 $q_{ij}$ 表示 $w_j$ 字的条件概率 $P(w_j\mid w_i)$ 在跳过图模型中给出的单词 $w_i$，我们有 
+
+$$q_{ij}=\frac{\exp(\mathbf{u}_j^\top \mathbf{v}_i)}{ \sum_{k \in \mathcal{V}} \text{exp}(\mathbf{u}_k^\top \mathbf{v}_i)},$$
+
+其中，任何索引 $i$ 向量 $\mathbf{v}_i$ 和 $\mathbf{u}_i$ 分别表示单词 $w_i$ 作为中心词和上下文词，$\mathcal{V} = \{0, 1, \ldots, |\mathcal{V}|-1\}$ 是词汇的索引集。 
+
+考虑一下可能在语料库中多次出现的单词 $w_i$。在整个语料库中，所有上下文单词无论 $w_i$ 被视为中心词，都构成了 * 多集 * $\mathcal{C}_i$ 的单词索引，* 允许同一元素的多个实例 *。对于任何元素，它的实例数都称为其 * 多重性 *。举个例子来说明，假设单词 $w_i$ 在语料库中出现两次，在两个上下文窗口中以 $w_i$ 作为中心词的上下文词的索引是 $k, j, m, k$ 和 $k, l, k, j$。因此，多集 $\mathcal{C}_i = \{j, j, k, k, k, k, l, m\}$，其中元素 $j, k, l, m$ 的多重性分别为 2、4、1、1。 
+
+现在让我们将多集 $\mathcal{C}_i$ 中元素 $j$ 的多重性表示为 $x_{ij}$。这是整个语料库中同一上下文窗口中单词 $w_j$（作为上下文单词）和单词 $w_i$（作为中心词）的全局共生计数。使用这样的全局语料库统计数据，跳过图模型的损失函数等同于 
+
+$$-\sum_{i\in\mathcal{V}}\sum_{j\in\mathcal{V}} x_{ij} \log\,q_{ij}.$$
+:eqlabel:`eq_skipgram-x_ij`
+
+我们进一步用 $x_i$ 表示上下文窗口中所有上下文单词的数量，其中 $w_i$ 作为中心词出现，相当于 $|\mathcal{C}_i|$。让 $p_{ij}$ 成为生成上下文单词 $w_j$ 的条件概率 $x_{ij}/x_i$，给定中心字 $w_i$，:eqref:`eq_skipgram-x_ij` 可以重写为 
+
+$$-\sum_{i\in\mathcal{V}} x_i \sum_{j\in\mathcal{V}} p_{ij} \log\,q_{ij}.$$
+:eqlabel:`eq_skipgram-p_ij`
+
+在 :eqref:`eq_skipgram-p_ij` 中，$-\sum_{j\in\mathcal{V}} p_{ij} \log\,q_{ij}$ 计算了全局语料库统计数据的条件分布 $p_{ij}$ 的交叉熵和模型预测的条件分布 $q_{ij}$。如上所述，这一损失也加权了 $x_i$。最大限度地减少 :eqref:`eq_skipgram-p_ij` 中的损失函数将允许预测的条件分布接近全球语料库统计数据中的条件分布。 
+
+尽管通常用于测量概率分布之间的距离，但交叉熵损失函数可能不是一个很好的选择。一方面，正如我们在 :numref:`sec_approx_train` 中提到的那样，正确标准化 $q_{ij}$ 的成本导致了整个词汇的总和，这可能是计算昂贵的。另一方面，来自大量语料库的大量罕见事件通常以交叉熵损失为模型，而不能分配过多的权重。 
+
+## Glove 模型
+
+有鉴于此，*Glove* 模型基于平方损失 :cite:`Pennington.Socher.Manning.2014` 对跳跃图模型进行了三项更改： 
+
+1. 使用变量 $p'_{ij}=x_{ij}$ 和 $q'_{ij}=\exp(\mathbf{u}_j^\top \mathbf{v}_i)$ 
+这不是概率分布，而是两者的对数，因此平方损失期限为 $\left(\log\,p'_{ij} - \log\,q'_{ij}\right)^2 = \left(\mathbf{u}_j^\top \mathbf{v}_i - \log\,x_{ij}\right)^2$。
+2. 为每个单词 $w_i$ 添加两个标量模型参数：中心词偏差 $b_i$ 和上下文词偏差 $c_i$。
+3. 用权重函数 $h(x_{ij})$ 替换每个损失期的权重，其中 $h(x)$ 在 $[0, 1]$ 的间隔内增加了 $h(x)$。
+
+将所有事情放在一起，训练 GLOVE 是为了尽量减少以下损失功能： 
+
+$$\sum_{i\in\mathcal{V}} \sum_{j\in\mathcal{V}} h(x_{ij}) \left(\mathbf{u}_j^\top \mathbf{v}_i + b_i + c_j - \log\,x_{ij}\right)^2.$$
+:eqlabel:`eq_glove-loss`
+
+对于权重函数，建议的选择是：$h(x) = (x/c) ^\alpha$（例如 $\alpha = 0.75$）如果是 $x < c$（例如 $c = 100$），否则为 $h(x) = 1$。在这种情况下，由于 $h(0)=0$，为了计算效率，可以省略任何 $x_{ij}=0$ 的平方损失期限。例如，当使用迷你批随机梯度下降进行训练时，在每次迭代中，我们都随机采样一个 * 非零 * $x_{ij}$ 的迷你匹配，以计算渐变并更新模型参数。请注意，这些非零 $x_{ij}$ 是预先计算的全局语料库统计数据；因此，该模型被称为 *Global Vector* 的 Glove。 
+
+应该强调的是，如果单词 $w_i$ 出现在单词 $w_j$ 的上下文窗口中，那么 * 反之 *。因此，$x_{ij}=x_{ji}$。与适合不对称条件概率 $p_{ij}$ 的 word2vec 不同，Glove 适合对称 $\log \, x_{ij}$。因此，在 GLOVE 模型中，任何单词的中心单词矢量和上下文单词矢量在数学上是等同的。但是在实践中，由于初始值不同，训练后同一个词可能仍然会在这两个向量中得到不同的值：GloVE 将它们总结为输出矢量。 
+
+## 从共发概率比例解释 Glove
+
+我们也可以从另一个角度解释 GLOVE 模型。在 :numref:`subsec_skipgram-global` 中使用相同的符号，让 $p_{ij} \stackrel{\mathrm{def}}{=} P(w_j \mid w_i)$ 成为生成上下文单词 $w_j$ 的条件概率，给定 $w_i$ 作为语料库中的中心词。:numref:`tab_glove` 列出了 “冰” 和 “蒸汽” 两个词的几个共同出现概率及其基于大型语料库统计数据的比率。 
+
+:Word-word co-occurrence probabilities and their ratios from a large corpus (adapted from Table 1 in :cite:`Pennington.Socher.Manning.2014`:) 
+
+|$w_k$=|solid|gas|water|fashion|
+|:--|:-|:-|:-|:-|
+|$p_1=P(w_k\mid \text{ice})$|0.00019|0.000066|0.003|0.000017|
+|$p_2=P(w_k\mid\text{steam})$|0.000022|0.00078|0.0022|0.000018|
+|$p_1/p_2$|8.9|0.085|1.36|0.96|
+:label:`tab_glove`
+
+我们可以从 :numref:`tab_glove` 观察到以下内容： 
+
+* 对于与 “冰” 有关但与 “蒸汽” 无关的单词 $w_k$，例如 $w_k=\text{solid}$，我们预计共发概率比例更大，例如 8.9。
+* 对于与 “蒸汽” 有关但与 “冰” 无关的单词 $w_k$，例如 $w_k=\text{gas}$，我们预计共发概率比例较小，例如 0.085。
+* 对于与 “冰” 和 “蒸汽” 都有关的单词 $w_k$，例如 $w_k=\text{water}$，我们预计共同发生概率的比率接近 1，例如 1.36。
+* 对于与 “冰” 和 “蒸汽” 无关的单词 $w_k$，例如 $w_k=\text{fashion}$，我们预计共同发生概率的比率接近 1，例如 0.96。
+
+可以看出，共同发生概率的比率可以直观地表达单词之间的关系。因此，我们可以设计一个由三个词向量组成的函数来适应这个比例。对于共发概率的比例 ${p_{ij}}/{p_{ik}}$，其中 $w_i$ 是中心词，$w_j$ 和 $w_k$ 是上下文词，我们希望使用一些函数 $f$ 来调整这个比率： 
+
+$$f(\mathbf{u}_j, \mathbf{u}_k, {\mathbf{v}}_i) \approx \frac{p_{ij}}{p_{ik}}.$$
+:eqlabel:`eq_glove-f`
+
+在 $f$ 的许多可能设计中，我们只选择以下合理的选择。由于共生概率比率是标量，因此我们要求 $f$ 是标量函数，例如 $f(\mathbf{u}_j, \mathbf{u}_k, {\mathbf{v}}_i) = f\left((\mathbf{u}_j - \mathbf{u}_k)^\top {\mathbf{v}}_i\right)$。在 :eqref:`eq_glove-f` 中切换字指数 $j$ 和 $k$，它必须持有 $f(x)f(-x)=1$，所以一种可能性是 $f(x)=\exp(x)$，即  
+
+$$f(\mathbf{u}_j, \mathbf{u}_k, {\mathbf{v}}_i) = \frac{\exp\left(\mathbf{u}_j^\top {\mathbf{v}}_i\right)}{\exp\left(\mathbf{u}_k^\top {\mathbf{v}}_i\right)} \approx \frac{p_{ij}}{p_{ik}}.$$
+
+现在让我们选择 $\exp\left(\mathbf{u}_j^\top {\mathbf{v}}_i\right) \approx \alpha p_{ij}$，其中 $\alpha$ 是常数。自 $p_{ij}=x_{ij}/x_i$ 起，在双方对数后，我们得到了 $\mathbf{u}_j^\top {\mathbf{v}}_i \approx \log\,\alpha + \log\,x_{ij} - \log\,x_i$。我们可能会使用其他偏见术语来适应 $- \log\, \alpha + \log\, x_i$，例如中心词偏差 $b_i$ 和上下文词偏差 $c_j$： 
+
+$$\mathbf{u}_j^\top \mathbf{v}_i + b_i + c_j \approx \log\, x_{ij}.$$
+:eqlabel:`eq_glove-square`
+
+用重量测量 :eqref:`eq_glove-square` 的平方误差，获得了 :eqref:`eq_glove-loss` 中的 Glove 损失函数。 
+
+## 摘要
+
+* 跳过图模型可以使用全局语料库统计数据（例如单词共生计数）来解释。
+* 交叉熵损失可能不是衡量两种概率分布差异的好选择，特别是对于大型语料库而言。GLOVE 使用平方损耗来适应预先计算的全局语料库统计数据。
+* 中心单词矢量和上下文单词矢量在数学上对于 GloVE 中的任何单词来说都是等同的。
+* GLOVE 可以从单词-词共生概率的比率来解释。
+
+## 练习
+
+1. 如果单词 $w_i$ 和 $w_j$ 同时出现在同一个上下文窗口中，我们怎样才能使用它们在文本序列中的距离来重新设计计算条件概率 $p_{ij}$ 的方法？Hint: see Section 4.2 of the GloVe paper :cite:`Pennington.Socher.Manning.2014`。
+1. 对于任何单词来说，它的中心词偏见和上下文词偏见在 Glove 中数学上是否等同？为什么？
+
+[Discussions](https://discuss.d2l.ai/t/385)
diff --git a/chapter_natural-language-processing-pretraining/glove_origin.md b/chapter_natural-language-processing-pretraining/glove_origin.md
new file mode 100644
index 000000000..8902278ae
--- /dev/null
+++ b/chapter_natural-language-processing-pretraining/glove_origin.md
@@ -0,0 +1,285 @@
+# Word Embedding with Global Vectors (GloVe)
+:label:`sec_glove`
+
+
+Word-word co-occurrences 
+within context windows
+may carry rich semantic information.
+For example,
+in a large corpus
+word "solid" is
+more likely to co-occur
+with "ice" than "steam",
+but word "gas"
+probably co-occurs with "steam"
+more frequently than "ice".
+Besides,
+global corpus statistics
+of such co-occurrences
+can be precomputed:
+this can lead to more efficient training.
+To leverage statistical
+information in the entire corpus
+for word embedding,
+let us first revisit
+the skip-gram model in :numref:`subsec_skip-gram`,
+but interpreting it
+using global corpus statistics
+such as co-occurrence counts.
+
+## Skip-Gram with Global Corpus Statistics
+:label:`subsec_skipgram-global`
+
+Denoting by $q_{ij}$
+the conditional probability
+$P(w_j\mid w_i)$
+of word $w_j$ given word $w_i$
+in the skip-gram model,
+we have
+
+$$q_{ij}=\frac{\exp(\mathbf{u}_j^\top \mathbf{v}_i)}{ \sum_{k \in \mathcal{V}} \text{exp}(\mathbf{u}_k^\top \mathbf{v}_i)},$$
+
+where 
+for any index $i$
+vectors $\mathbf{v}_i$ and $\mathbf{u}_i$
+represent word $w_i$
+as the center word and context word,
+respectively, and $\mathcal{V} = \{0, 1, \ldots, |\mathcal{V}|-1\}$ 
+is the index set of the vocabulary.
+
+Consider word $w_i$
+that may occur multiple times
+in the corpus.
+In the entire corpus,
+all the context words
+wherever $w_i$ is taken as their center word
+form a *multiset* $\mathcal{C}_i$
+of word indices
+that *allows for multiple instances of the same element*.
+For any element,
+its number of instances is called its *multiplicity*.
+To illustrate with an example,
+suppose that word $w_i$ occurs twice in the corpus
+and indices of the context words
+that take $w_i$ as their center word
+in the two context windows
+are 
+$k, j, m, k$ and $k, l, k, j$.
+Thus, multiset $\mathcal{C}_i = \{j, j, k, k, k, k, l, m\}$, where 
+multiplicities of elements $j, k, l, m$
+are 2, 4, 1, 1, respectively.
+
+Now let us denote the multiplicity of element $j$ in
+multiset $\mathcal{C}_i$ as $x_{ij}$.
+This is the global co-occurrence count 
+of word $w_j$ (as the context word)
+and word $w_i$ (as the center word)
+in the same context window
+in the entire corpus.
+Using such global corpus statistics,
+the loss function of the skip-gram model 
+is equivalent to
+
+$$-\sum_{i\in\mathcal{V}}\sum_{j\in\mathcal{V}} x_{ij} \log\,q_{ij}.$$
+:eqlabel:`eq_skipgram-x_ij`
+
+We further denote by
+$x_i$
+the number of all the context words
+in the context windows
+where $w_i$ occurs as their center word,
+which is equivalent to $|\mathcal{C}_i|$.
+Letting $p_{ij}$
+be the conditional probability
+$x_{ij}/x_i$ for generating
+context word $w_j$ given center word $w_i$,
+:eqref:`eq_skipgram-x_ij`
+can be rewritten as
+
+$$-\sum_{i\in\mathcal{V}} x_i \sum_{j\in\mathcal{V}} p_{ij} \log\,q_{ij}.$$
+:eqlabel:`eq_skipgram-p_ij`
+
+In :eqref:`eq_skipgram-p_ij`, $-\sum_{j\in\mathcal{V}} p_{ij} \log\,q_{ij}$ calculates
+the cross-entropy 
+of
+the conditional distribution $p_{ij}$
+of global corpus statistics
+and
+the
+conditional distribution $q_{ij}$
+of model predictions.
+This loss
+is also weighted by $x_i$ as explained above.
+Minimizing the loss function in 
+:eqref:`eq_skipgram-p_ij`
+will allow
+the predicted conditional distribution
+to get close to
+the conditional distribution
+from the global corpus statistics.
+
+
+Though being commonly used
+for measuring the distance
+between probability distributions,
+the cross-entropy loss function may not be a good choice here. 
+On one hand, as we mentioned in :numref:`sec_approx_train`, 
+the cost of properly normalizing $q_{ij}$
+results in the sum over the entire vocabulary,
+which can be computationally expensive.
+On the other hand, 
+a large number of rare 
+events from a large corpus
+are often modeled by the cross-entropy loss
+to be assigned with
+too much weight.
+
+## The GloVe Model
+
+In view of this,
+the *GloVe* model makes three changes
+to the skip-gram model based on squared loss :cite:`Pennington.Socher.Manning.2014`:
+
+1. Use variables $p'_{ij}=x_{ij}$ and $q'_{ij}=\exp(\mathbf{u}_j^\top \mathbf{v}_i)$ 
+that are not probability distributions
+and take the logarithm of both, so the squared loss term is $\left(\log\,p'_{ij} - \log\,q'_{ij}\right)^2 = \left(\mathbf{u}_j^\top \mathbf{v}_i - \log\,x_{ij}\right)^2$.
+2. Add two scalar model parameters for each word $w_i$: the center word bias $b_i$ and the context word bias $c_i$.
+3. Replace the weight of each loss term with the weight function $h(x_{ij})$, where $h(x)$ is increasing in the interval of $[0, 1]$.
+
+Putting all things together, training GloVe is to minimize the following loss function:
+
+$$\sum_{i\in\mathcal{V}} \sum_{j\in\mathcal{V}} h(x_{ij}) \left(\mathbf{u}_j^\top \mathbf{v}_i + b_i + c_j - \log\,x_{ij}\right)^2.$$
+:eqlabel:`eq_glove-loss`
+
+For the weight function, a suggested choice is: 
+$h(x) = (x/c) ^\alpha$ (e.g $\alpha = 0.75$) if $x < c$ (e.g., $c = 100$); otherwise $h(x) = 1$.
+In this case,
+because $h(0)=0$,
+the squared loss term for any $x_{ij}=0$ can be omitted
+for computational efficiency.
+For example,
+when using minibatch stochastic gradient descent for training, 
+at each iteration
+we randomly sample a minibatch of *non-zero* $x_{ij}$ 
+to calculate gradients
+and update the model parameters. 
+Note that these non-zero $x_{ij}$ are precomputed 
+global corpus statistics;
+thus, the model is called GloVe
+for *Global Vectors*.
+
+It should be emphasized that
+if word $w_i$ appears in the context window of 
+word $w_j$, then *vice versa*. 
+Therefore, $x_{ij}=x_{ji}$. 
+Unlike word2vec
+that fits the asymmetric conditional probability
+$p_{ij}$,
+GloVe fits the symmetric $\log \, x_{ij}$.
+Therefore, the center word vector and
+the context word vector of any word are mathematically equivalent in the GloVe model. 
+However in practice, owing to different initialization values,
+the same word may still get different values
+in these two vectors after training:
+GloVe sums them up as the output vector.
+
+
+
+## Interpreting GloVe from the Ratio of Co-occurrence Probabilities
+
+
+We can also interpret the GloVe model from another perspective. 
+Using the same notation in 
+:numref:`subsec_skipgram-global`,
+let $p_{ij} \stackrel{\mathrm{def}}{=} P(w_j \mid w_i)$ be the conditional probability of generating the context word $w_j$ given $w_i$ as the center word in the corpus. 
+:numref:`tab_glove`
+lists several co-occurrence probabilities
+given words "ice" and "steam"
+and their ratios based on  statistics from a large corpus.
+
+
+:Word-word co-occurrence probabilities and their ratios from a large corpus (adapted from Table 1 in :cite:`Pennington.Socher.Manning.2014`:)
+
+
+|$w_k$=|solid|gas|water|fashion|
+|:--|:-|:-|:-|:-|
+|$p_1=P(w_k\mid \text{ice})$|0.00019|0.000066|0.003|0.000017|
+|$p_2=P(w_k\mid\text{steam})$|0.000022|0.00078|0.0022|0.000018|
+|$p_1/p_2$|8.9|0.085|1.36|0.96|
+:label:`tab_glove`
+
+
+We can observe the following from :numref:`tab_glove`:
+
+* For a word $w_k$ that is related to "ice" but unrelated to "steam", such as $w_k=\text{solid}$, we expect a larger ratio of co-occurence probabilities, such as 8.9.
+* For a word $w_k$ that is related to "steam" but unrelated to "ice", such as $w_k=\text{gas}$, we expect a smaller ratio of co-occurence probabilities, such as 0.085.
+* For a word $w_k$ that is related to both "ice" and "steam", such as $w_k=\text{water}$, we expect a ratio of co-occurence probabilities that is close to 1, such as 1.36.
+* For a word $w_k$ that is unrelated to both "ice" and "steam", such as $w_k=\text{fashion}$, we expect a ratio of co-occurence probabilities that is close to 1, such as 0.96.
+
+
+
+
+It can be seen that the ratio
+of co-occurrence probabilities
+can intuitively express
+the relationship between words. 
+Thus, we can design a function
+of three word vectors
+to fit this ratio.
+For the ratio of co-occurrence probabilities
+${p_{ij}}/{p_{ik}}$
+with $w_i$ being the center word
+and $w_j$ and $w_k$ being the context words,
+we want to fit this ratio
+using some function $f$:
+
+$$f(\mathbf{u}_j, \mathbf{u}_k, {\mathbf{v}}_i) \approx \frac{p_{ij}}{p_{ik}}.$$
+:eqlabel:`eq_glove-f`
+
+Among many possible designs for $f$,
+we only pick a reasonable choice in the following.
+Since the ratio of co-occurrence probabilities
+is a scalar,
+we require that
+$f$ be a scalar function, such as
+$f(\mathbf{u}_j, \mathbf{u}_k, {\mathbf{v}}_i) = f\left((\mathbf{u}_j - \mathbf{u}_k)^\top {\mathbf{v}}_i\right)$. 
+Switching word indices
+$j$ and $k$ in :eqref:`eq_glove-f`,
+it must hold that
+$f(x)f(-x)=1$,
+so one possibility is $f(x)=\exp(x)$,
+i.e., 
+
+$$f(\mathbf{u}_j, \mathbf{u}_k, {\mathbf{v}}_i) = \frac{\exp\left(\mathbf{u}_j^\top {\mathbf{v}}_i\right)}{\exp\left(\mathbf{u}_k^\top {\mathbf{v}}_i\right)} \approx \frac{p_{ij}}{p_{ik}}.$$
+
+Now let us pick
+$\exp\left(\mathbf{u}_j^\top {\mathbf{v}}_i\right) \approx \alpha p_{ij}$,
+where $\alpha$ is a constant.
+Since $p_{ij}=x_{ij}/x_i$, after taking the logarithm on both sides we get $\mathbf{u}_j^\top {\mathbf{v}}_i \approx \log\,\alpha + \log\,x_{ij} - \log\,x_i$. 
+We may use additional bias terms to fit $- \log\, \alpha + \log\, x_i$, such as the center word bias $b_i$ and the context word bias $c_j$:
+
+$$\mathbf{u}_j^\top \mathbf{v}_i + b_i + c_j \approx \log\, x_{ij}.$$
+:eqlabel:`eq_glove-square`
+
+Measuring the squared error of
+:eqref:`eq_glove-square` with weights,
+the GloVe loss function in
+:eqref:`eq_glove-loss` is obtained.
+
+
+
+## Summary
+
+* The skip-gram model can be interpreted using global corpus statistics such as word-word co-occurrence counts.
+* The cross-entropy loss may not be a good choice for measuring the difference of two probability distributions, especially for a large corpus. GloVe uses squared loss to fit precomputed global corpus statistics.
+* The center word vector and the context word vector are mathematically equivalent for any word in GloVe.
+* GloVe can be interpreted from the ratio of word-word co-occurrence probabilities.
+
+
+## Exercises
+
+1. If words $w_i$ and $w_j$ co-occur in the same context window, how can we use their   distance in the text sequence to redesign the method for  calculating the conditional probability $p_{ij}$? Hint: see Section 4.2 of the GloVe paper :cite:`Pennington.Socher.Manning.2014`.
+1. For any word, are its center word bias  and context word bias mathematically equivalent in GloVe? Why?
+
+
+[Discussions](https://discuss.d2l.ai/t/385)
diff --git a/chapter_natural-language-processing-pretraining/similarity-analogy.md b/chapter_natural-language-processing-pretraining/similarity-analogy.md
new file mode 100644
index 000000000..9b2226760
--- /dev/null
+++ b/chapter_natural-language-processing-pretraining/similarity-analogy.md
@@ -0,0 +1,220 @@
+# 词相似性和类比
+:label:`sec_synonyms`
+
+在 :numref:`sec_word2vec_pretraining` 中，我们在一个小数据集上训练了一个 word2vec 模型，并应用它来查找输入词语义上相似的词。实际上，在大型语言上预训练的单词向量可以应用于下游自然语言处理任务，后面将在 :numref:`chap_nlp_app` 中介绍这些任务。为了以直接的方式展示来自大型语言的预训练单词矢量的语义，让我们在单词相似性和类比任务中应用它们。
+
+```{.python .input}
+from d2l import mxnet as d2l
+from mxnet import np, npx
+import os
+
+npx.set_np()
+```
+
+```{.python .input}
+#@tab pytorch
+from d2l import torch as d2l
+import torch
+from torch import nn
+import os
+```
+
+## 加载预训练的词向量
+
+下面列出了维度 50、100 和 300 的预训练 GLOVE 嵌入，可以从 [GloVe website](https://nlp.stanford.edu/projects/glove/) 下载。经过预训练的 FastText 嵌入提供多种语言版本。这里我们考虑一个英文版本（300 维 “wiki Ien”），它可以从 [fastText website](https://fasttext.cc/) 下载。
+
+```{.python .input}
+#@tab all
+#@save
+d2l.DATA_HUB['glove.6b.50d'] = (d2l.DATA_URL + 'glove.6B.50d.zip',
+                                '0b8703943ccdb6eb788e6f091b8946e82231bc4d')
+
+#@save
+d2l.DATA_HUB['glove.6b.100d'] = (d2l.DATA_URL + 'glove.6B.100d.zip',
+                                 'cd43bfb07e44e6f27cbcc7bc9ae3d80284fdaf5a')
+
+#@save
+d2l.DATA_HUB['glove.42b.300d'] = (d2l.DATA_URL + 'glove.42B.300d.zip',
+                                  'b5116e234e9eb9076672cfeabf5469f3eec904fa')
+
+#@save
+d2l.DATA_HUB['wiki.en'] = (d2l.DATA_URL + 'wiki.en.zip',
+                           'c1816da3821ae9f43899be655002f6c723e91b88')
+```
+
+为了加载这些预训练的 Glove 和 FastText 嵌入，我们定义了以下 `TokenEmbedding` 类。
+
+```{.python .input}
+#@tab all
+#@save
+class TokenEmbedding:
+    """Token Embedding."""
+    def __init__(self, embedding_name):
+        self.idx_to_token, self.idx_to_vec = self._load_embedding(
+            embedding_name)
+        self.unknown_idx = 0
+        self.token_to_idx = {token: idx for idx, token in
+                             enumerate(self.idx_to_token)}
+
+    def _load_embedding(self, embedding_name):
+        idx_to_token, idx_to_vec = ['<unk>'], []
+        data_dir = d2l.download_extract(embedding_name)
+        # GloVe website: https://nlp.stanford.edu/projects/glove/
+        # fastText website: https://fasttext.cc/
+        with open(os.path.join(data_dir, 'vec.txt'), 'r') as f:
+            for line in f:
+                elems = line.rstrip().split(' ')
+                token, elems = elems[0], [float(elem) for elem in elems[1:]]
+                # Skip header information, such as the top row in fastText
+                if len(elems) > 1:
+                    idx_to_token.append(token)
+                    idx_to_vec.append(elems)
+        idx_to_vec = [[0] * len(idx_to_vec[0])] + idx_to_vec
+        return idx_to_token, d2l.tensor(idx_to_vec)
+
+    def __getitem__(self, tokens):
+        indices = [self.token_to_idx.get(token, self.unknown_idx)
+                   for token in tokens]
+        vecs = self.idx_to_vec[d2l.tensor(indices)]
+        return vecs
+
+    def __len__(self):
+        return len(self.idx_to_token)
+```
+
+下面我们加载 50 维 GloVE 嵌入物（在维基百科子集上预训练）。创建 `TokenEmbedding` 实例时，如果尚未下载指定的嵌入文件，则必须下载该文件。
+
+```{.python .input}
+#@tab all
+glove_6b50d = TokenEmbedding('glove.6b.50d')
+```
+
+输出词汇量大小。词汇包含 40 万个单词（令牌）和一个特殊的未知标记。
+
+```{.python .input}
+#@tab all
+len(glove_6b50d)
+```
+
+我们可以在词汇中获得单词的索引，反之亦然。
+
+```{.python .input}
+#@tab all
+glove_6b50d.token_to_idx['beautiful'], glove_6b50d.idx_to_token[3367]
+```
+
+## 应用预训练的词向量
+
+使用加载的 Glove 向量，我们将通过在以下单词相似性和类比任务中应用它们来演示它们的语义。 
+
+### 词相似性
+
+与 :numref:`subsec_apply-word-embed` 类似，为了根据词矢量之间的余弦相似性为输入词找到语义上相似的词，我们实现了以下 `knn`（$k$-最近邻）函数。
+
+```{.python .input}
+def knn(W, x, k):
+    # Add 1e-9 for numerical stability
+    cos = np.dot(W, x.reshape(-1,)) / (
+        np.sqrt(np.sum(W * W, axis=1) + 1e-9) * np.sqrt((x * x).sum()))
+    topk = npx.topk(cos, k=k, ret_typ='indices')
+    return topk, [cos[int(i)] for i in topk]
+```
+
+```{.python .input}
+#@tab pytorch
+def knn(W, x, k):
+    # Add 1e-9 for numerical stability
+    cos = torch.mv(W, x.reshape(-1,)) / (
+        torch.sqrt(torch.sum(W * W, axis=1) + 1e-9) *
+        torch.sqrt((x * x).sum()))
+    _, topk = torch.topk(cos, k=k)
+    return topk, [cos[int(i)] for i in topk]
+```
+
+然后，我们使用 `TokenEmbedding` 实例 `embed` 实例 `embed` 中的预先训练的词向量来搜索类似的单词。
+
+```{.python .input}
+#@tab all
+def get_similar_tokens(query_token, k, embed):
+    topk, cos = knn(embed.idx_to_vec, embed[[query_token]], k + 1)
+    for i, c in zip(topk[1:], cos[1:]):  # Exclude the input word
+        print(f'cosine sim={float(c):.3f}: {embed.idx_to_token[int(i)]}')
+```
+
+`glove_6b50d` 中的预训练单词矢量的词汇包含 40 万个单词和一个特殊的未知标记。不包括输入单词和未知标记，在这个词汇中，我们可以找到三个与单词 “芯片” 在语义上最相似的单词。
+
+```{.python .input}
+#@tab all
+get_similar_tokens('chip', 3, glove_6b50d)
+```
+
+下面输出了与 “宝贝” 和 “美丽” 类似的词语。
+
+```{.python .input}
+#@tab all
+get_similar_tokens('baby', 3, glove_6b50d)
+```
+
+```{.python .input}
+#@tab all
+get_similar_tokens('beautiful', 3, glove_6b50d)
+```
+
+### 单词类比
+
+除了找到类似的单词之外，我们还可以将单词矢量应用于单词类比任务。例如，“男人”：“女人”። “儿子”：“女儿” 是一个词类比的形式：“男人” 是 “女人”，因为 “儿子” 就是 “女儿”。具体来说，“类比完成任务” 这个词可以定义为：对于单词类比 $a : b :: c : d$，前三个词 $a$、$b$ 和 $c$，找 $d$。用 $\text{vec}(w)$ 表示单词 $w$ 的矢量。为了完成这个比喻，我们将找到矢量与 $\text{vec}(c)+\text{vec}(b)-\text{vec}(a)$ 结果最相似的单词。
+
+```{.python .input}
+#@tab all
+def get_analogy(token_a, token_b, token_c, embed):
+    vecs = embed[[token_a, token_b, token_c]]
+    x = vecs[1] - vecs[0] + vecs[2]
+    topk, cos = knn(embed.idx_to_vec, x, 1)
+    return embed.idx_to_token[int(topk[0])]  # Remove unknown words
+```
+
+让我们使用加载的单词矢量来验证 “男-女” 类比。
+
+```{.python .input}
+#@tab all
+get_analogy('man', 'woman', 'son', glove_6b50d)
+```
+
+以下是 “首都国” 类比：“北京”：“中国”። “东京”：“日本”。这演示了预训练的单词矢量中的语义。
+
+```{.python .input}
+#@tab all
+get_analogy('beijing', 'china', 'tokyo', glove_6b50d)
+```
+
+对于 “形容词-超级形容词” 类比，如 “坏”：“最坏”። “大”：“最大”，我们可以看到，预训练的单词向量可能会捕获句法信息。
+
+```{.python .input}
+#@tab all
+get_analogy('bad', 'worst', 'big', glove_6b50d)
+```
+
+为了在预先训练的单词矢量中显示捕获的过去时概念，我们可以使用 “现在十-过去时” 类比来测试语法：“do”: “do”። “Go”: “走了”。
+
+```{.python .input}
+#@tab all
+get_analogy('do', 'did', 'go', glove_6b50d)
+```
+
+## 摘要
+
+* 实际上，在大型语言上预训练的单词向量可以应用于下游自然语言处理任务。
+* 预训练的词向量可以应用于单词相似性和类比任务。
+
+## 练习
+
+1. 使用 `TokenEmbedding('wiki.en')` 测试 FastText 结果。
+1. 当词汇量极大时，我们怎样才能更快地找到类似的单词或完成单词类比？
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/387)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/1336)
+:end_tab:
diff --git a/chapter_natural-language-processing-pretraining/similarity-analogy_origin.md b/chapter_natural-language-processing-pretraining/similarity-analogy_origin.md
new file mode 100644
index 000000000..93a6cd1f7
--- /dev/null
+++ b/chapter_natural-language-processing-pretraining/similarity-analogy_origin.md
@@ -0,0 +1,294 @@
+# Word Similarity and Analogy
+:label:`sec_synonyms`
+
+In :numref:`sec_word2vec_pretraining`, 
+we trained a word2vec model on a small dataset, 
+and applied it
+to find semantically similar words 
+for an input word.
+In practice,
+word vectors that are pretrained
+on large corpora can be
+applied to downstream
+natural language processing tasks,
+which will be covered later
+in :numref:`chap_nlp_app`.
+To demonstrate 
+semantics of pretrained word vectors
+from large corpora in a straightforward way,
+let us apply them
+in the word similarity and analogy tasks.
+
+```{.python .input}
+from d2l import mxnet as d2l
+from mxnet import np, npx
+import os
+
+npx.set_np()
+```
+
+```{.python .input}
+#@tab pytorch
+from d2l import torch as d2l
+import torch
+from torch import nn
+import os
+```
+
+## Loading Pretrained Word Vectors
+
+Below lists pretrained GloVe embeddings of dimension 50, 100, and 300,
+which can be downloaded from the [GloVe website](https://nlp.stanford.edu/projects/glove/).
+The pretrained fastText embeddings are available in multiple languages.
+Here we consider one English version (300-dimensional "wiki.en") that can be downloaded from the
+[fastText website](https://fasttext.cc/).
+
+```{.python .input}
+#@tab all
+#@save
+d2l.DATA_HUB['glove.6b.50d'] = (d2l.DATA_URL + 'glove.6B.50d.zip',
+                                '0b8703943ccdb6eb788e6f091b8946e82231bc4d')
+
+#@save
+d2l.DATA_HUB['glove.6b.100d'] = (d2l.DATA_URL + 'glove.6B.100d.zip',
+                                 'cd43bfb07e44e6f27cbcc7bc9ae3d80284fdaf5a')
+
+#@save
+d2l.DATA_HUB['glove.42b.300d'] = (d2l.DATA_URL + 'glove.42B.300d.zip',
+                                  'b5116e234e9eb9076672cfeabf5469f3eec904fa')
+
+#@save
+d2l.DATA_HUB['wiki.en'] = (d2l.DATA_URL + 'wiki.en.zip',
+                           'c1816da3821ae9f43899be655002f6c723e91b88')
+```
+
+To load these pretrained GloVe and fastText embeddings, we define the following `TokenEmbedding` class.
+
+```{.python .input}
+#@tab all
+#@save
+class TokenEmbedding:
+    """Token Embedding."""
+    def __init__(self, embedding_name):
+        self.idx_to_token, self.idx_to_vec = self._load_embedding(
+            embedding_name)
+        self.unknown_idx = 0
+        self.token_to_idx = {token: idx for idx, token in
+                             enumerate(self.idx_to_token)}
+
+    def _load_embedding(self, embedding_name):
+        idx_to_token, idx_to_vec = ['<unk>'], []
+        data_dir = d2l.download_extract(embedding_name)
+        # GloVe website: https://nlp.stanford.edu/projects/glove/
+        # fastText website: https://fasttext.cc/
+        with open(os.path.join(data_dir, 'vec.txt'), 'r') as f:
+            for line in f:
+                elems = line.rstrip().split(' ')
+                token, elems = elems[0], [float(elem) for elem in elems[1:]]
+                # Skip header information, such as the top row in fastText
+                if len(elems) > 1:
+                    idx_to_token.append(token)
+                    idx_to_vec.append(elems)
+        idx_to_vec = [[0] * len(idx_to_vec[0])] + idx_to_vec
+        return idx_to_token, d2l.tensor(idx_to_vec)
+
+    def __getitem__(self, tokens):
+        indices = [self.token_to_idx.get(token, self.unknown_idx)
+                   for token in tokens]
+        vecs = self.idx_to_vec[d2l.tensor(indices)]
+        return vecs
+
+    def __len__(self):
+        return len(self.idx_to_token)
+```
+
+Below we load the
+50-dimensional GloVe embeddings
+(pretrained on a Wikipedia subset).
+When creating the `TokenEmbedding` instance,
+the specified embedding file has to be downloaded if it
+was not yet.
+
+```{.python .input}
+#@tab all
+glove_6b50d = TokenEmbedding('glove.6b.50d')
+```
+
+Output the vocabulary size. The vocabulary contains 400000 words (tokens) and a special unknown token.
+
+```{.python .input}
+#@tab all
+len(glove_6b50d)
+```
+
+We can get the index of a word in the vocabulary, and vice versa.
+
+```{.python .input}
+#@tab all
+glove_6b50d.token_to_idx['beautiful'], glove_6b50d.idx_to_token[3367]
+```
+
+## Applying Pretrained Word Vectors
+
+Using the loaded GloVe vectors,
+we will demonstrate their semantics
+by applying them
+in the following word similarity and analogy tasks.
+
+
+### Word Similarity
+
+Similar to :numref:`subsec_apply-word-embed`,
+in order to find semantically similar words
+for an input word
+based on cosine similarities between
+word vectors,
+we implement the following `knn`
+($k$-nearest neighbors) function.
+
+```{.python .input}
+def knn(W, x, k):
+    # Add 1e-9 for numerical stability
+    cos = np.dot(W, x.reshape(-1,)) / (
+        np.sqrt(np.sum(W * W, axis=1) + 1e-9) * np.sqrt((x * x).sum()))
+    topk = npx.topk(cos, k=k, ret_typ='indices')
+    return topk, [cos[int(i)] for i in topk]
+```
+
+```{.python .input}
+#@tab pytorch
+def knn(W, x, k):
+    # Add 1e-9 for numerical stability
+    cos = torch.mv(W, x.reshape(-1,)) / (
+        torch.sqrt(torch.sum(W * W, axis=1) + 1e-9) *
+        torch.sqrt((x * x).sum()))
+    _, topk = torch.topk(cos, k=k)
+    return topk, [cos[int(i)] for i in topk]
+```
+
+Then, we 
+search for similar words
+using the pretrained word vectors 
+from the `TokenEmbedding` instance `embed`.
+
+```{.python .input}
+#@tab all
+def get_similar_tokens(query_token, k, embed):
+    topk, cos = knn(embed.idx_to_vec, embed[[query_token]], k + 1)
+    for i, c in zip(topk[1:], cos[1:]):  # Exclude the input word
+        print(f'cosine sim={float(c):.3f}: {embed.idx_to_token[int(i)]}')
+```
+
+The vocabulary of the pretrained word vectors
+in `glove_6b50d` contains 400000 words and a special unknown token. 
+Excluding the input word and unknown token,
+among this vocabulary
+let us find 
+three most semantically similar words
+to word "chip".
+
+```{.python .input}
+#@tab all
+get_similar_tokens('chip', 3, glove_6b50d)
+```
+
+Below outputs similar words
+to "baby" and "beautiful".
+
+```{.python .input}
+#@tab all
+get_similar_tokens('baby', 3, glove_6b50d)
+```
+
+```{.python .input}
+#@tab all
+get_similar_tokens('beautiful', 3, glove_6b50d)
+```
+
+### Word Analogy
+
+Besides finding similar words,
+we can also apply word vectors
+to word analogy tasks.
+For example,
+“man”:“woman”::“son”:“daughter”
+is the form of a word analogy:
+“man” is to “woman” as “son” is to “daughter”.
+Specifically,
+the word analogy completion task
+can be defined as:
+for a word analogy 
+$a : b :: c : d$, given the first three words $a$, $b$ and $c$, find $d$. 
+Denote the vector of word $w$ by $\text{vec}(w)$. 
+To complete the analogy,
+we will find the word 
+whose vector is most similar
+to the result of $\text{vec}(c)+\text{vec}(b)-\text{vec}(a)$.
+
+```{.python .input}
+#@tab all
+def get_analogy(token_a, token_b, token_c, embed):
+    vecs = embed[[token_a, token_b, token_c]]
+    x = vecs[1] - vecs[0] + vecs[2]
+    topk, cos = knn(embed.idx_to_vec, x, 1)
+    return embed.idx_to_token[int(topk[0])]  # Remove unknown words
+```
+
+Let us verify the "male-female" analogy using the loaded word vectors.
+
+```{.python .input}
+#@tab all
+get_analogy('man', 'woman', 'son', glove_6b50d)
+```
+
+Below completes a
+“capital-country” analogy: 
+“beijing”:“china”::“tokyo”:“japan”.
+This demonstrates 
+semantics in the pretrained word vectors.
+
+```{.python .input}
+#@tab all
+get_analogy('beijing', 'china', 'tokyo', glove_6b50d)
+```
+
+For the
+“adjective-superlative adjective” analogy
+such as 
+“bad”:“worst”::“big”:“biggest”,
+we can see that the pretrained word vectors
+may capture the syntactic information.
+
+```{.python .input}
+#@tab all
+get_analogy('bad', 'worst', 'big', glove_6b50d)
+```
+
+To show the captured notion
+of past tense in the pretrained word vectors,
+we can test the syntax using the
+"present tense-past tense" analogy: “do”:“did”::“go”:“went”.
+
+```{.python .input}
+#@tab all
+get_analogy('do', 'did', 'go', glove_6b50d)
+```
+
+## Summary
+
+* In practice, word vectors that are pretrained on large corpora can be applied to downstream natural language processing tasks.
+* Pretrained word vectors can be applied to the word similarity and analogy tasks.
+
+
+## Exercises
+
+1. Test the fastText results using `TokenEmbedding('wiki.en')`.
+1. When the vocabulary is extremely large, how can we find similar words or complete a word analogy faster?
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/387)
+:end_tab:
+
+:begin_tab:`pytorch`
+[Discussions](https://discuss.d2l.ai/t/1336)
+:end_tab:
diff --git a/chapter_natural-language-processing-pretraining/subword-embedding.md b/chapter_natural-language-processing-pretraining/subword-embedding.md
new file mode 100644
index 000000000..7f985fcf7
--- /dev/null
+++ b/chapter_natural-language-processing-pretraining/subword-embedding.md
@@ -0,0 +1,146 @@
+# 子词嵌入
+:label:`sec_fasttext`
+
+在英语中，“帮助”、“帮助” 和 “帮助” 等单词都是同一个词 “帮助” 的变形形式。“狗” 和 “狗” 之间的关系与 “猫” 和 “猫” 之间的关系相同，“男孩” 和 “男朋友” 之间的关系与 “女孩” 和 “女朋友” 之间的关系相同。在法语和西班牙语等其他语言中，许多动词有 40 多种变形形式，而在芬兰语中，名词最多可能有 15 种情况。在语言学中，形态学研究单词形成和词汇关系。但是，在 word2vec 和 Glove 中都没有探索单词的内部结构。 
+
+## FastText 模型
+
+回想一下 Word2vec 中单词是如何表示的。在跳过图模型和连续字包模型中，同一个单词的不同变形形式直接由不同的矢量表示，没有共享参数。为了使用形态学信息，*FastText* 模型提出了一种 * 子词嵌入 * 方法，其中子字是字符 $n$ 克 :cite:`Bojanowski.Grave.Joulin.ea.2017`。FastText 不是学习单词级矢量表示形式，而是可以将 FastText 视为副词级跳过图，其中每个 * 中心单词 * 由其子词矢量的总和表示。 
+
+让我们说明如何使用 “在哪里” 一词为 FastText 中的每个中心单词获取子词。首先，<” and “> 在单词的开头和结尾添加特殊字符 “”，以区分其他子词的前缀和后缀。然后，从单词中提取字符 $n$ 克。例如，当 $n=3$ 时，我们获得长度为 3 的所有子词：“<wh”, “whe”, “her”, “ere”, “re>” 和特殊的子词 “<where>”。 
+
+在 FastText 中，对于任何一个单词 $w$，用 $\mathcal{G}_w$ 表示其长度介于 3 到 6 之间的所有子词及其特殊子词的并集。词汇是所有单词的子词的结合。让 $\mathbf{z}_g$ 成为字典中子词 $g$ 的矢量，而字 $w$ 的矢量 $w$ 作为跳过图模型中的中心词是其子词矢量的总和： 
+
+$$\mathbf{v}_w = \sum_{g\in\mathcal{G}_w} \mathbf{z}_g.$$
+
+FastText 的其余部分与跳过图模型相同。与跳过图模型相比，FastText 中的词汇量更大，导致更多的模型参数。此外，为了计算单词的表示形式，必须将其所有子词向量求和，从而导致更高的计算复杂性。但是，由于结构相似的单词之间的子词共享参数，稀有单词甚至是词汇不足的单词可以在 FastText 中获得更好的矢量表示形式。 
+
+## 字节对编码
+:label:`subsec_Byte_Pair_Encoding`
+
+在 FastText 中，所有提取的子词必须是指定的长度，例如 $3$ 到 $6$，因此不能预定义词汇大小。为了允许在固定大小的词汇中使用可变长度的子词，我们可以应用名为 * 字节对编码 * (BPE) 的压缩算法来提取子词 :cite:`Sennrich.Haddow.Birch.2015`。 
+
+字节对编码对训练数据集执行统计分析，以发现单词中的常见符号，例如任意长度的连续字符。从长度为 1 的符号开始，字节对编码以迭代方式合并最常用的一对连续符号，以生成新的更长的符号。请注意，为了提高效率，不考虑跨越词界的货币对。最后，我们可以使用这样的符号作为子词来对单词进行分段。字节对编码及其变体已用于流行的自然语言处理预训练模型中的输入表示，例如 GPT-2 :cite:`Radford.Wu.Child.ea.2019` 和 Roberta :cite:`Liu.Ott.Goyal.ea.2019`。在下面，我们将说明字节对编码的工作原理。 
+
+首先，我们将符号的词汇初始化为所有英文小写字符、一个特殊的词尾符号 `'_'` 和一个特殊的未知符号 `'[UNK]'`。
+
+```{.python .input}
+#@tab all
+import collections
+
+symbols = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
+           'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
+           '_', '[UNK]']
+```
+
+由于我们不考虑跨越单词边界的符号对，因此我们只需要将字典 `raw_token_freqs` 映射到数据集中的频率（出现次数）。请注意，每个单词附加了特殊符号 `'_'`，以便我们可以轻松地从一系列输出符号（例如 “a_ tall er_ man”）中恢复一个单词序列（例如，“更高的人”）。由于我们从只包含单个字符和特殊符号的词汇开始合并过程，因此每个单词中的每对连续字符（字典 `token_freqs` 的键）之间都会插入空格。换句话说，空格是单词中符号之间的分隔符。
+
+```{.python .input}
+#@tab all
+raw_token_freqs = {'fast_': 4, 'faster_': 3, 'tall_': 5, 'taller_': 4}
+token_freqs = {}
+for token, freq in raw_token_freqs.items():
+    token_freqs[' '.join(list(token))] = raw_token_freqs[token]
+token_freqs
+```
+
+我们定义了以下 `get_max_freq_pair` 函数，该函数返回单词中最常见的一对连续符号，其中单词来自输入字典 `token_freqs` 的键。
+
+```{.python .input}
+#@tab all
+def get_max_freq_pair(token_freqs):
+    pairs = collections.defaultdict(int)
+    for token, freq in token_freqs.items():
+        symbols = token.split()
+        for i in range(len(symbols) - 1):
+            # Key of `pairs` is a tuple of two consecutive symbols
+            pairs[symbols[i], symbols[i + 1]] += freq
+    return max(pairs, key=pairs.get)  # Key of `pairs` with the max value
+```
+
+作为基于连续符号频率的贪婪方法，字节对编码将使用以下 `merge_symbols` 函数合并最常见的连续符号对以生成新的符号。
+
+```{.python .input}
+#@tab all
+def merge_symbols(max_freq_pair, token_freqs, symbols):
+    symbols.append(''.join(max_freq_pair))
+    new_token_freqs = dict()
+    for token, freq in token_freqs.items():
+        new_token = token.replace(' '.join(max_freq_pair),
+                                  ''.join(max_freq_pair))
+        new_token_freqs[new_token] = token_freqs[token]
+    return new_token_freqs
+```
+
+现在我们对字典 `token_freqs` 的密钥迭代执行字节对编码算法。在第一次迭代中，最常见的连续符号对是 `'t'` 和 `'a'`，因此字节对编码将它们合并以生成一个新的符号 `'ta'`。在第二次迭代中，字节对编码继续合并 `'ta'` 和 `'l'`，从而产生另一个新的符号 `'tal'`。
+
+```{.python .input}
+#@tab all
+num_merges = 10
+for i in range(num_merges):
+    max_freq_pair = get_max_freq_pair(token_freqs)
+    token_freqs = merge_symbols(max_freq_pair, token_freqs, symbols)
+    print(f'merge #{i + 1}:', max_freq_pair)
+```
+
+在对字节对编码进行了 10 次迭代之后，我们可以看到列表 `symbols` 现在包含了另外 10 个与其他符号迭代合并的符号。
+
+```{.python .input}
+#@tab all
+print(symbols)
+```
+
+对于字典 `raw_token_freqs` 键中指定的同一数据集，由于字节对编码算法的结果，数据集中的每个单词现在都被子词 “fast_”、“fast”、“er_”、“tall_” 和 “tall” 分割。例如，单词 “faster_” 和 “taller_” 分别分为 “快速 er_” 和 “高尔 _”。
+
+```{.python .input}
+#@tab all
+print(list(token_freqs.keys()))
+```
+
+请注意，字节对编码的结果取决于正在使用的数据集。我们还可以使用从一个数据集中学到的子词来对另一个数据集的单词进行分段。作为一种贪婪的方法，以下 `segment_BPE` 函数试图将输入参数 `symbols` 中的单词分成尽可能长的子词。
+
+```{.python .input}
+#@tab all
+def segment_BPE(tokens, symbols):
+    outputs = []
+    for token in tokens:
+        start, end = 0, len(token)
+        cur_output = []
+        # Segment token with the longest possible subwords from symbols
+        while start < len(token) and start < end:
+            if token[start: end] in symbols:
+                cur_output.append(token[start: end])
+                start = end
+                end = len(token)
+            else:
+                end -= 1
+        if start < len(token):
+            cur_output.append('[UNK]')
+        outputs.append(' '.join(cur_output))
+    return outputs
+```
+
+在下面，我们使用从上述数据集中学习的列表 `symbols` 中的子词对表示另一个数据集的 `tokens` 进行细分。
+
+```{.python .input}
+#@tab all
+tokens = ['tallest_', 'fatter_']
+print(segment_BPE(tokens, symbols))
+```
+
+## 摘要
+
+* FastText 模型提出了一种子词嵌入方法。基于 word2vec 中的跳过图模型，它表示一个中心词作为其子词矢量的总和。
+* 字节对编码对训练数据集执行统计分析，以发现单词中的常见符号。作为一种贪婪的方法，字节对编码以迭代方式合并最常见的一对连续符号。
+* 子词嵌入可能会提高稀有单词和字典外单词的表示质量。
+
+## 练习
+
+1. 例如，英语中约有 $3\times 10^8$ 克可能有 $6$ 克。当子词太多时，问题是什么？如何解决这个问题？Hint: refer to the end of Section 3.2 of the fastText paper :cite:`Bojanowski.Grave.Joulin.ea.2017`。
+1. 如何基于连续词袋模型设计子词嵌入模型？
+1. 要获得大小为 $m$ 的词汇，当初符号词汇量大小为 $n$ 时，需要多少合并操作？
+1. 如何扩展字节对编码的想法来提取短语？
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/386)
+:end_tab:
diff --git a/chapter_natural-language-processing-pretraining/subword-embedding_origin.md b/chapter_natural-language-processing-pretraining/subword-embedding_origin.md
new file mode 100644
index 000000000..dbfc36648
--- /dev/null
+++ b/chapter_natural-language-processing-pretraining/subword-embedding_origin.md
@@ -0,0 +1,244 @@
+# Subword Embedding
+:label:`sec_fasttext`
+
+In English,
+words such as
+"helps", "helped", and "helping" are 
+inflected forms of the same word "help".
+The relationship 
+between "dog" and "dogs"
+is the same as 
+that between "cat" and "cats",
+and 
+the relationship 
+between "boy" and "boyfriend"
+is the same as 
+that between "girl" and "girlfriend".
+In other languages
+such as French and Spanish,
+many verbs have over 40 inflected forms,
+while in Finnish,
+a noun may have up to 15 cases.
+In linguistics,
+morphology studies word formation and word relationships.
+However,
+the internal structure of words
+was neither explored in word2vec
+nor in GloVe.
+
+## The fastText Model
+
+Recall how words are represented in word2vec.
+In both the skip-gram model
+and the continuous bag-of-words model,
+different inflected forms of the same word
+are directly represented by different vectors
+without shared parameters.
+To use morphological information,
+the *fastText* model
+proposed a *subword embedding* approach,
+where a subword is a character $n$-gram :cite:`Bojanowski.Grave.Joulin.ea.2017`.
+Instead of learning word-level vector representations,
+fastText can be considered as
+the subword-level skip-gram,
+where each *center word* is represented by the sum of 
+its subword vectors.
+
+Let us illustrate how to obtain 
+subwords for each center word in fastText
+using the word "where".
+First, add special characters “&lt;” and “&gt;” 
+at the beginning and end of the word to distinguish prefixes and suffixes from other subwords. 
+Then, extract character $n$-grams from the word.
+For example, when $n=3$,
+we obtain all subwords of length 3: "&lt;wh", "whe", "her", "ere", "re&gt;", and the special subword "&lt;where&gt;".
+
+
+In fastText, for any word $w$,
+denote by $\mathcal{G}_w$
+the union of all its subwords of length between 3 and 6
+and its special subword.
+The vocabulary 
+is the union of the subwords of all words.
+Letting $\mathbf{z}_g$
+be the vector of subword $g$ in the dictionary,
+the vector $\mathbf{v}_w$ for 
+word $w$ as a center word
+in the skip-gram model
+is the sum of its subword vectors:
+
+$$\mathbf{v}_w = \sum_{g\in\mathcal{G}_w} \mathbf{z}_g.$$
+
+The rest of fastText is the same as the skip-gram model. Compared with the skip-gram model, 
+the vocabulary in fastText is larger,
+resulting in more model parameters. 
+Besides, 
+to calculate the representation of a word,
+all its subword vectors
+have to be summed,
+leading to higher computational complexity.
+However,
+thanks to shared parameters from subwords among words with similar structures,
+rare words and even out-of-vocabulary words
+may obtain better vector representations in fastText.
+
+
+
+## Byte Pair Encoding
+:label:`subsec_Byte_Pair_Encoding`
+
+In fastText, all the extracted subwords have to be of the specified lengths, such as $3$ to $6$, thus the vocabulary size cannot be predefined.
+To allow for variable-length subwords in a fixed-size vocabulary,
+we can apply a compression algorithm
+called *byte pair encoding* (BPE) to extract subwords :cite:`Sennrich.Haddow.Birch.2015`.
+
+Byte pair encoding performs a statistical analysis of the training dataset to discover common symbols within a word,
+such as consecutive characters of arbitrary length.
+Starting from symbols of length 1,
+byte pair encoding iteratively merges the most frequent pair of consecutive symbols to produce new longer symbols.
+Note that for efficiency, pairs crossing word boundaries are not considered.
+In the end, we can use such symbols as subwords to segment words.
+Byte pair encoding and its variants has been used for input representations in popular natural language processing pretraining models such as GPT-2 :cite:`Radford.Wu.Child.ea.2019` and RoBERTa :cite:`Liu.Ott.Goyal.ea.2019`.
+In the following, we will illustrate how byte pair encoding works.
+
+First, we initialize the vocabulary of symbols as all the English lowercase characters, a special end-of-word symbol `'_'`, and a special unknown symbol `'[UNK]'`.
+
+```{.python .input}
+#@tab all
+import collections
+
+symbols = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
+           'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
+           '_', '[UNK]']
+```
+
+Since we do not consider symbol pairs that cross boundaries of words,
+we only need a dictionary `raw_token_freqs` that maps words to their frequencies (number of occurrences)
+in a dataset.
+Note that the special symbol `'_'` is appended to each word so that
+we can easily recover a word sequence (e.g., "a taller man")
+from a sequence of output symbols ( e.g., "a_ tall er_ man").
+Since we start the merging process from a vocabulary of only single characters and special symbols, space is inserted between every pair of consecutive characters within each word (keys of the dictionary `token_freqs`).
+In other words, space is the delimiter between symbols within a word.
+
+```{.python .input}
+#@tab all
+raw_token_freqs = {'fast_': 4, 'faster_': 3, 'tall_': 5, 'taller_': 4}
+token_freqs = {}
+for token, freq in raw_token_freqs.items():
+    token_freqs[' '.join(list(token))] = raw_token_freqs[token]
+token_freqs
+```
+
+We define the following `get_max_freq_pair` function that
+returns the most frequent pair of consecutive symbols within a word,
+where words come from keys of the input dictionary `token_freqs`.
+
+```{.python .input}
+#@tab all
+def get_max_freq_pair(token_freqs):
+    pairs = collections.defaultdict(int)
+    for token, freq in token_freqs.items():
+        symbols = token.split()
+        for i in range(len(symbols) - 1):
+            # Key of `pairs` is a tuple of two consecutive symbols
+            pairs[symbols[i], symbols[i + 1]] += freq
+    return max(pairs, key=pairs.get)  # Key of `pairs` with the max value
+```
+
+As a greedy approach based on frequency of consecutive symbols,
+byte pair encoding will use the following `merge_symbols` function to merge the most frequent pair of consecutive symbols to produce new symbols.
+
+```{.python .input}
+#@tab all
+def merge_symbols(max_freq_pair, token_freqs, symbols):
+    symbols.append(''.join(max_freq_pair))
+    new_token_freqs = dict()
+    for token, freq in token_freqs.items():
+        new_token = token.replace(' '.join(max_freq_pair),
+                                  ''.join(max_freq_pair))
+        new_token_freqs[new_token] = token_freqs[token]
+    return new_token_freqs
+```
+
+Now we iteratively perform the byte pair encoding algorithm over the keys of the dictionary `token_freqs`. In the first iteration, the most frequent pair of consecutive symbols are `'t'` and `'a'`, thus byte pair encoding merges them to produce a new symbol `'ta'`. In the second iteration, byte pair encoding continues to merge `'ta'` and `'l'` to result in another new symbol `'tal'`.
+
+```{.python .input}
+#@tab all
+num_merges = 10
+for i in range(num_merges):
+    max_freq_pair = get_max_freq_pair(token_freqs)
+    token_freqs = merge_symbols(max_freq_pair, token_freqs, symbols)
+    print(f'merge #{i + 1}:', max_freq_pair)
+```
+
+After 10 iterations of byte pair encoding, we can see that list `symbols` now contains 10 more symbols that are iteratively merged from other symbols.
+
+```{.python .input}
+#@tab all
+print(symbols)
+```
+
+For the same dataset specified in the keys of the dictionary `raw_token_freqs`,
+each word in the dataset is now segmented by subwords "fast_", "fast", "er_", "tall_", and "tall"
+as a result of the byte pair encoding algorithm.
+For instance, words "faster_" and "taller_" are segmented as "fast er_" and "tall er_", respectively.
+
+```{.python .input}
+#@tab all
+print(list(token_freqs.keys()))
+```
+
+Note that the result of byte pair encoding depends on the dataset being used.
+We can also use the subwords learned from one dataset
+to segment words of another dataset.
+As a greedy approach, the following `segment_BPE` function tries to break words into the longest possible subwords from the input argument `symbols`.
+
+```{.python .input}
+#@tab all
+def segment_BPE(tokens, symbols):
+    outputs = []
+    for token in tokens:
+        start, end = 0, len(token)
+        cur_output = []
+        # Segment token with the longest possible subwords from symbols
+        while start < len(token) and start < end:
+            if token[start: end] in symbols:
+                cur_output.append(token[start: end])
+                start = end
+                end = len(token)
+            else:
+                end -= 1
+        if start < len(token):
+            cur_output.append('[UNK]')
+        outputs.append(' '.join(cur_output))
+    return outputs
+```
+
+In the following, we use the subwords in list `symbols`, which is learned from the aforementioned dataset,
+to segment `tokens` that represent another dataset.
+
+```{.python .input}
+#@tab all
+tokens = ['tallest_', 'fatter_']
+print(segment_BPE(tokens, symbols))
+```
+
+## Summary
+
+* The fastText model proposes a subword embedding approach. Based on the skip-gram model in word2vec, it represents a center word as the sum of its subword vectors.
+* Byte pair encoding performs a statistical analysis of the training dataset to discover common symbols within a word. As a greedy approach, byte pair encoding iteratively merges the most frequent pair of consecutive symbols.
+* Subword embedding may improve the quality of representations of rare words and out-of-dictionary words.
+
+## Exercises
+
+1. As an example, there are about $3\times 10^8$ possible  $6$-grams in English. What is the issue when there are too many subwords? How to address the issue? Hint: refer to the end of Section 3.2 of the fastText paper :cite:`Bojanowski.Grave.Joulin.ea.2017`.
+1. How to design a subword embedding model based on the continuous bag-of-words model?
+1. To get a vocabulary of size $m$, how many merging operations are needed when the initial symbol vocabulary size is $n$?
+1. How to extend the idea of byte pair encoding to extract phrases?
+
+
+
+:begin_tab:`mxnet`
+[Discussions](https://discuss.d2l.ai/t/386)
+:end_tab: