- Overview of the model
- Sequence of the steps to perform
- Structure of the code
- Before you start
- Download and prepare the dataset
- BERT input function
- How to run
- Appendix
Bidirectional Transformers for Language Understanding (BERT) is an encoder-only transformer-based model designed for natural language understanding.
This directory contains implementations of the BERT model. It uses a stack of transformer blocks with multi-head attention followed by a multi-layer perceptron feed-forward network.
We support removing next-sentence-prediction (NSP) loss from BERT training processing with only masked-language-modeling (MLM) loss.
The training pipeline has 2
phases. We first train with maximum sequence length of 128
and then train with maximum sequence length of 512
. More details of the model can be found in the appendix.
An overview of the model diagram is here:
We also support the RoBERTa model, which is very similar to BERT in terms of architectural design. In order to improve the results on BERT, some changes are made with objective functions (removing NSP), batch sizes, sequence lengths and masking patterns (dynamic vs. static). Difference between dynamic and static masking is discussed here.
This document walks you through an example showing the steps to run a BERT pre-training on the Cerebras Wafer-Scale Cluster (and on GPUs) using the code in this repo.
Note: You can use any subset of the steps described in this example. For example, if you already downloaded your preferred dataset and created the CSV files, then you can skip the section Preprocess data.
The following block diagram shows a high-level view of the sequence of steps you will perform in this example.
configs/
: YAML configuration files.fine_tuning/
: Code for fine-tuning the BERT model.- data/nlp/bert: Input pipeline implementation based on the Open Web Text dataset. This directory also contains the scripts you can use to download and prepare the Open Web Text dataset. Vocab files are located in models/vocab.
model.py
: Provides a common wrapper for all models under classBertForPreTrainingModel
, which interfaces with model-specific code. In this repo the model-specific code, i.e., model architecture is inbert_pretrain_models.py::BertPretrainModel
and the finetuning model architectures are inbert_finetune_models.py
. This wrapper provides a common interface for handling the function call of the model with its specific data format. It also provides a common interface to use the same format of configuration files fromconfigs/
to construct various models.data.py
: The entry point to the data input pipeline code.run.py
: Training script. Performs training and validation.utils.py
: Miscellaneous helper functions.
This example walk-through consists of two main steps:
- Prepare the dataset.
- Perform the pre-training.
This example follows the standard practice of 2
-phase pre-training for BERT models. In the 2
-phase pre-training, the model is:
- First pre-trained with the maximum sequence length (MSL) of
128
for90%
of the steps. - Then the final
10%
of the steps are pre-trained with the MSL of512
.
CSV files for each phase: You will need to create different CSV files for each of these 2
phases of pre-training, details here.
The scripts for downloading and preprocessing OpenWebText dataset: https://skylion007.github.io/OpenWebTextCorpus/ are located here.
Start by downloading the OWT dataset by accessing the following link from a browser:
https://drive.google.com/uc?id=1EA5V0oetDCOke7afsktL_JDQ-ETtNOvx
and manually download the tar.xz
file from that location to your preferred local directory.
NOTE: Currently a server side issue with the OWT site prevents using the below download_and_extract.sh shell script to download this tar file. We will update the script when this issue resolved.
RoBERTa is trained with OpenWebText and the following datasets in the original paper:
NOTE: In order to replicate our results, please use the dataset provided by us. It is a combination of a few of the datasets above but not all. If you want to replicate the results of the original paper, please go to the links and download the required datasets from there.
To extract the above-downloaded files, run download_and_extract.sh shell script:
bash download_and_extract.sh
NOTE: The download_and_extract.sh may take a while to complete, as it unpacks
40GB
of data (8,013,770
documents).
Upon completion, the script will produce openwebtext
folder in the same folder where the tar file is located. The openwebtext
folder will have multiple subfolders, each containing a collection of *.txt
files of raw text data (one document per .txt
file).
NOTE: For other datasets that are used with RoBERTa, you can download the dataset from the links and extract to your preferred location with the corresponding extraction commands.
In the next step, you will create two subsets of extracted txt files, one for training and the second for validation. These two subsets are then used to create CSV files that will be used for pre-training.
IMPORTANT: The training and validation subsets must contain mutually exclusive .txt files.
Proceed as follows:
Define metadata files that contain paths to subsets of documents in the openwebtext
folder to be used for training or validation.
For training, in this tutorial we use a subset of 512,000
documents. The associated metadata file can be found in metadata/train_512k.txt.
For validation, we choose 5,000
documents that are not in the training set. The metadata file for validation can be found in metadata/val_files.txt.
NOTE: You are free to create your own metadata files that define your train and validation data subsets, with your preferred content and sizes. You can also create a data subset for the test.
Next, using the metadata file that defines a data subset (for training or for validation), create CSV files containing masked sequences and labels derived from the data subset.
The preprocessing comprises of creating CSV files containing sequences and labels.
Prerequisites
If you do not have spaCy, the natural language processing (NLP) library, then install it with the following commands:
pip install spacy
python -m spacy download en
create_csv.py
To create CSV files containing sequences and labels derived from the data subset, you will use the Python utility create_csv.py
located in the data_preparation/nlp/bert directory.
create_csv_mlm_only.py
In addition, create_csv_mlm_only.py script can be used to create data without the NSP labels.
Note that create_csv.py
or create_csv_mlm_only.py
is intended to work with dynamic masking so they do not create masking during preprocessing. Dynamic masking is created on the fly in the BertCSVDynamicMaskDataProcessor
.
create_csv_static_masking.py
create_csv_static_masking.py script can be used to create dataset with static masking, which performs masking only once during data preprocessing so on each epoch the same input masks are applied.
create_csv_mlm_only_static_masking.py
create_csv_mlm_only_static_masking.py script can be used to create static masking without NSP labels.
Note that,
create_csv.py
orcreate_csv_mlm_only.py
makes csv files to be used withBertCSVDynamicMaskDataProcessor
.create_csv_static_masking.py
orcreate_csv_mlm_only_static_masking.py
makes csv files to be used withBertCSVDataProcessor.py
.BertCSVDataProcessor.py
loads the masking created during data preprocessing, so on each epoch the same input masks are applied.BertCSVDynamicMaskDataProcessor.py
creates the masking on the fly every time the data is loaded, so the input masks of sentences are different on each epoch.
Refer to README.md for more details.
Syntax
The command-line syntax to run the Python utility create_csv.py
is as follows:
python create_csv.py --metadata_files /path/to/metadata_file.txt --input_files_prefix /path/to/raw/data/openwebtext --vocab_file /path/to/vocab.txt --do_lower_case
where:
metadata_file.txt
is a metadata file containing a list of paths to documents, and/path/to/vocab.txt
contains a vocabulary file to map WordPieces to word IDs.
For example, you can use the supplied metadata/train_512k.txt
as an input to generate a training set based on 512,000
documents. Sample vocabularies can be found in the models/vocab
folder.
Some arguments and their usage is listed below:
-
--metadata_files
: path to text file containing a list of file names corresponding to the raw input documents to be processed and stored; can handle multiple metadata files separated by space (default:None
). -
--input_files_prefix
: prefix to be added to paths of the input files. For example, can be a directory where raw data is stored if the paths are relative. -
--vocab_file
: path to vocabulary (default:None
). -
--split_num
: number of input files to read at a given time for processing (default:1000
). -
--do_lower_case
: pass this flag to lower case the input text; should beTrue
for uncased models andFalse
for cased models (default:False
). -
--max_seq_length
: maximum sequence length (default:128
). -
--masked_lm_prob
: masked LM probability (default:0.15
). -
--max_predictions_per_seq
: maximum number of masked LM predictions per sequence (default:20
). -
--output_dir
: directory where CSV files will be stored (default:./csvfiles/
). -
--seed
: random seed (default:0
).
For more details, run the command: python create_csv.py --help
(or python create_csv_static_masking.py --help
if creating statically masked data).
For the 2
-phase BERT pre-training that we are following in this tutorial, you need to generate the following datasets:
- A training dataset for the first phase using sequences with maximum length of
128
. - A second training dataset for the second phase using sequences with maximum sequence length of
512
.
If you want to run validation, then:
- Two additional validation datasets, one for each phase.
In total, to run training and validation for both the pre-training phases, you will need four datasets: a training and a validation dataset for phase 1
with MSL 128
, and a training and a validation dataset for phase 2
with MSL 512
.
Proceed as follows to run the following commands:
Generate training CSV files:
python create_csv.py --metadata_files metadata/train_512k.txt --input_files_prefix /path/to/raw/data/openwebtext --vocab_file ../../vocab/google_research_uncased_L-12_H-768_A-12.txt --output_dir train_512k_uncased_msl128 --do_lower_case --max_seq_length 128 --max_predictions_per_seq 20
Generate validation CSV files:
python create_csv.py --metadata_files metadata/val_files.txt --input_files_prefix /path/to/raw/data/openwebtext --vocab_file ../../vocab/google_research_uncased_L-12_H-768_A-12.txt --output_dir val_uncased_msl128 --do_lower_case --max_seq_length 128 --max_predictions_per_seq 20
Generate training CSV files:
python create_csv.py --metadata_files metadata/train_512k.txt --input_files_prefix /path/to/raw/data/openwebtext --vocab_file ../../vocab/google_research_uncased_L-12_H-768_A-12.txt --output_dir train_512k_uncased_msl512 --do_lower_case --max_seq_length 512 --max_predictions_per_seq 80
Generate validation CSV files:
python create_csv.py --metadata_files metadata/val_files.txt --input_files_prefix /path/to/raw/data/openwebtext --vocab_file ../../vocab/google_research_uncased_L-12_H-768_A-12.txt --output_dir val_uncased_msl512 --do_lower_case --max_seq_length 512 --max_predictions_per_seq 80
The above-created CSV files are then used by the BertCSVDynamicMaskDataProcessor
class to produce inputs to the model.
If you want to use static masking instead, generate the csv files with create_csv_static_masking.py
and specify BertCSVDataProcessor
as the data_processor
under train_input in the config file.
If you want to use your own data loader with this example code, then this section describes the input data format expected by BertForPreTrainingModel
class defined in model.py.
When you create your own custom BERT input function, you must ensure that your BERT input function produces a features dictionary as described in this section.
The features dictionary has the following key/values:
-
input_ids
: Input token IDs, padded with0
tomax_sequence_length
.- Shape: [
batch_size
,max_sequence_length
]. - Type:
torch.int32
- Shape: [
-
attention_mask
: Mask for padded positions. Has values0
on the padded positions and1
elsewhere.- Shape: [
batch_size
,max_sequence_length
] - Type:
torch.int32
- Shape: [
-
token_type_ids
: Segment IDs. Has values0
on the positions corresponding to the first segment, and1
on positions corresponding to the second segment.- Shape: [
batch_size
,max_sequence_length
] - Type:
torch.int32
- Shape: [
-
masked_lm_positions
: Positions of masked tokens in theinput_ids
tensor, padded with0
tomax_predictions_per_seq
.- Shape: [
batch_size
,max_predictions_per_seq
] - Type:
torch.int32
- Shape: [
-
masked_lm_weights
: Mask formasked_lm_positions
. Has valuesbatch_size / num_masked_tokens_in_batch
on the positions corresponding to actually masked tokens in the given sample, and0.0
elsewhere. See the MLM Loss Scaling section for more detail.- Dimensions: [
batch_size
,max_predictions_per_seq
] - Type:
torch.float32
- Dimensions: [
-
labels
: Labels for computing the masked language modeling loss. Tokens with indices set to-100
are masked.- Dimensions: [
batch_size
,max_sequence_length
] - Type:
torch.int32
- Dimensions: [
-
next_sentence_label
: Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair where 0 indicates sequence B is a continuation of sequence A, 1 indicates sequence B is a random sequence.- Dimensions: [
batch_size
,] - Type:
torch.int32
- Dimensions: [
Note: You can omit the feature
token_type_ids
and thenext_sentence_label
if you are pre-training for MLM only. (in the yaml file, setdisable_nsp
to True under model params).
Parameter settings in YAML config file: The config YAML files are located in the configs directory. Before starting a pre-training run, make sure that in the YAML config file you are using:
- The
train_input.data_dir
parameter points to the correct dataset, and - The
train_input.max_sequence_length
parameter corresponds to the sequence length--max_seq_length
passed increate_csv.py
orcreate_csv_mlm_only.py
.
Phase-1 with MSL128
and phase-2 with MSL512
: To continue the pre-training with MSL=512
from a checkpoint that resulted from the first phase of MSL=128
, the parameter model.max_position_embeddings
should be set to 512
for both the phases of pre-training.
Phase-1 with MSL128
: If you would like to pre-train a model with MSL=128
and do not plan to use that model to continue pre-training with a longer sequence length, then you can change this model.max_position_embeddings
parameter to 128
.
Please follow the instructions on our quickstart in the Developer Docs.
If running on a cpu or gpu, activate the environment from Python GPU Environment setup, and simply run:
python run.py CPU --mode train --params /path/to/yaml --model_dir /path/to/model_dir
or
python run.py GPU --mode train --params /path/to/yaml --model_dir /path/to/model_dir
Note: Change the command to
--mode eval
for evaluation.
The MLM Loss is scaled by the number of masked tokens in the current batch. For numerical reasons this scaling is done in two steps:
- First, the
masked_lm_weights
tensor is scaled bybatch_size / num_masked_tokens_in_batch
in the input pipeline. - Then, the final loss is scaled by
1 / batch_size
.
In order to train the model, you need to provide a yaml config file. Below is the list of yaml config files included for this model implementation at configs folder. Also, feel free to create your own following these examples:
bert_base_*.yaml
have the standard bert-base config withhidden_size=768
,num_hidden_layers=12
,num_heads=12
.bert_large_*.yaml
have the standard bert-large config withhidden_size=1024
,num_hidden_layers=24
,num_heads=16
.- Files with substrings
MSL***
likebert_base_MSL128.yaml
contain different maximum sequence length. Popular sequence lengths are128
,512
,1024
provided in our example config files. - Files like
roberta_*.yaml
are provided to run RoBERTa model variants RoBERTa: A Robustly Optimized BERT Pretraining Approach.
Reference: BERT paper on arXiv.org: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Reference: RoBERTa paper on arXiv.org: RoBERTa: A Robustly Optimized BERT Pretraining Approach.