Name	Name	Last commit message	Last commit date
parent directory ..
README	README
classifier	classifier
configs	configs
extractive_summarization	extractive_summarization
images	images
token_classifier	token_classifier
README.md	README.md
__init__.py	__init__.py
bert_finetune_models.py	bert_finetune_models.py
bert_model.py	bert_model.py
bert_pretrain_models.py	bert_pretrain_models.py
config.py	config.py
data.py	data.py
model.py	model.py
run.py	run.py
utils.py	utils.py

Bidirectional Transformers for Language Understanding

Overview of the model
Sequence of the steps to perform
Structure of the code
Before you start
Download and prepare the dataset
BERT input function
- BERT features dictionary
How to run
Appendix

Overview of the model

Bidirectional Transformers for Language Understanding (BERT) is an encoder-only transformer-based model designed for natural language understanding. This directory contains implementations of the BERT model. It uses a stack of transformer blocks with multi-head attention followed by a multi-layer perceptron feed-forward network. We support removing next-sentence-prediction (NSP) loss from BERT training processing with only masked-language-modeling (MLM) loss. The training pipeline has 2 phases. We first train with maximum sequence length of 128 and then train with maximum sequence length of 512. More details of the model can be found in the appendix.

An overview of the model diagram is here:

We also support the RoBERTa model, which is very similar to BERT in terms of architectural design. In order to improve the results on BERT, some changes are made with objective functions (removing NSP), batch sizes, sequence lengths and masking patterns (dynamic vs. static). Difference between dynamic and static masking is discussed here.

Sequence of the steps to perform

This document walks you through an example showing the steps to run a BERT pre-training on the Cerebras Wafer-Scale Cluster (and on GPUs) using the code in this repo.

Note: You can use any subset of the steps described in this example. For example, if you already downloaded your preferred dataset and created the CSV files, then you can skip the section Preprocess data.

The following block diagram shows a high-level view of the sequence of steps you will perform in this example.

Structure of the code

configs/: YAML configuration files.
fine_tuning/: Code for fine-tuning the BERT model.
data/nlp/bert: Input pipeline implementation based on the Open Web Text dataset. This directory also contains the scripts you can use to download and prepare the Open Web Text dataset. Vocab files are located in models/vocab.
model.py: Provides a common wrapper for all models under class BertForPreTrainingModel, which interfaces with model-specific code. In this repo the model-specific code, i.e., model architecture is in bert_pretrain_models.py::BertPretrainModel and the finetuning model architectures are in bert_finetune_models.py. This wrapper provides a common interface for handling the function call of the model with its specific data format. It also provides a common interface to use the same format of configuration files from configs/ to construct various models.
data.py: The entry point to the data input pipeline code.
run.py: Training script. Performs training and validation.
utils.py: Miscellaneous helper functions.

Before you start

This example walk-through consists of two main steps:

Prepare the dataset.
Perform the pre-training.

This example follows the standard practice of 2-phase pre-training for BERT models. In the 2-phase pre-training, the model is:

First pre-trained with the maximum sequence length (MSL) of 128 for 90% of the steps.
Then the final 10% of the steps are pre-trained with the MSL of 512.

CSV files for each phase: You will need to create different CSV files for each of these 2 phases of pre-training, details here.

Download and prepare the dataset

Download

OpenWebText dataset

The scripts for downloading and preprocessing OpenWebText dataset: https://skylion007.github.io/OpenWebTextCorpus/ are located here.

Start by downloading the OWT dataset by accessing the following link from a browser:

https://drive.google.com/uc?id=1EA5V0oetDCOke7afsktL_JDQ-ETtNOvx

and manually download the tar.xz file from that location to your preferred local directory.

NOTE: Currently a server side issue with the OWT site prevents using the below download_and_extract.sh shell script to download this tar file. We will update the script when this issue resolved.

Extract

To extract the above-downloaded files, run download_and_extract.sh shell script:

bash download_and_extract.sh

NOTE: The download_and_extract.sh may take a while to complete, as it unpacks 40GB of data (8,013,770 documents).

Upon completion, the script will produce openwebtext folder in the same folder where the tar file is located. The openwebtext folder will have multiple subfolders, each containing a collection of *.txt files of raw text data (one document per .txt file).

NOTE: For other datasets that are used with RoBERTa, you can download the dataset from the links and extract to your preferred location with the corresponding extraction commands.

Allocate subsets for training and validation

In the next step, you will create two subsets of extracted txt files, one for training and the second for validation. These two subsets are then used to create CSV files that will be used for pre-training.

IMPORTANT: The training and validation subsets must contain mutually exclusive .txt files.

Proceed as follows:

Define metadata files that contain paths to subsets of documents in the openwebtext folder to be used for training or validation.

For training, in this tutorial we use a subset of 512,000 documents. The associated metadata file can be found in metadata/train_512k.txt.

For validation, we choose 5,000 documents that are not in the training set. The metadata file for validation can be found in metadata/val_files.txt.

NOTE: You are free to create your own metadata files that define your train and validation data subsets, with your preferred content and sizes. You can also create a data subset for the test.

Next, using the metadata file that defines a data subset (for training or for validation), create CSV files containing masked sequences and labels derived from the data subset.

Preprocess data

The preprocessing comprises of creating CSV files containing sequences and labels.

Prerequisites

If you do not have spaCy, the natural language processing (NLP) library, then install it with the following commands:

pip install spacy
python -m spacy download en

Relevant files

create_csv.py

To create CSV files containing sequences and labels derived from the data subset, you will use the Python utility create_csv.py located in the data_preparation/nlp/bert directory.

create_csv_mlm_only.py

In addition, create_csv_mlm_only.py script can be used to create data without the NSP labels.

Note that create_csv.py or create_csv_mlm_only.py is intended to work with dynamic masking so they do not create masking during preprocessing. Dynamic masking is created on the fly in the BertCSVDynamicMaskDataProcessor.

create_csv_static_masking.py

create_csv_static_masking.py script can be used to create dataset with static masking, which performs masking only once during data preprocessing so on each epoch the same input masks are applied.

create_csv_mlm_only_static_masking.py

create_csv_mlm_only_static_masking.py script can be used to create static masking without NSP labels.

Note that,

create_csv.py or create_csv_mlm_only.py makes csv files to be used with BertCSVDynamicMaskDataProcessor.
create_csv_static_masking.py or create_csv_mlm_only_static_masking.py makes csv files to be used with BertCSVDataProcessor.py.
BertCSVDataProcessor.py loads the masking created during data preprocessing, so on each epoch the same input masks are applied.
BertCSVDynamicMaskDataProcessor.py creates the masking on the fly every time the data is loaded, so the input masks of sentences are different on each epoch.

Refer to README.md for more details.

Syntax

The command-line syntax to run the Python utility create_csv.py is as follows:

python create_csv.py --metadata_files /path/to/metadata_file.txt --input_files_prefix /path/to/raw/data/openwebtext --vocab_file /path/to/vocab.txt --do_lower_case

where:

metadata_file.txt is a metadata file containing a list of paths to documents, and
/path/to/vocab.txt contains a vocabulary file to map WordPieces to word IDs.

For example, you can use the supplied metadata/train_512k.txt as an input to generate a training set based on 512,000 documents. Sample vocabularies can be found in the models/vocab folder.

Some arguments and their usage is listed below:

--metadata_files: path to text file containing a list of file names corresponding to the raw input documents to be processed and stored; can handle multiple metadata files separated by space (default: None).
--input_files_prefix: prefix to be added to paths of the input files. For example, can be a directory where raw data is stored if the paths are relative.
--vocab_file: path to vocabulary (default: None).
--split_num: number of input files to read at a given time for processing (default: 1000).
--do_lower_case: pass this flag to lower case the input text; should be True for uncased models and False for cased models (default: False).
--max_seq_length: maximum sequence length (default: 128).
--masked_lm_prob: masked LM probability (default: 0.15).
--max_predictions_per_seq: maximum number of masked LM predictions per sequence (default: 20).
--output_dir: directory where CSV files will be stored (default: ./csvfiles/).
--seed: random seed (default: 0).

For more details, run the command: python create_csv.py --help (or python create_csv_static_masking.py --help if creating statically masked data).

Create CSVs for `2`-phase pre-training

For the 2-phase BERT pre-training that we are following in this tutorial, you need to generate the following datasets:

A training dataset for the first phase using sequences with maximum length of 128.
A second training dataset for the second phase using sequences with maximum sequence length of 512.

If you want to run validation, then:

Two additional validation datasets, one for each phase.

In total, to run training and validation for both the pre-training phases, you will need four datasets: a training and a validation dataset for phase 1 with MSL 128, and a training and a validation dataset for phase 2 with MSL 512.

Proceed as follows to run the following commands:

Phase `1`: MSL `128`

Generate training CSV files:

python create_csv.py --metadata_files metadata/train_512k.txt --input_files_prefix /path/to/raw/data/openwebtext --vocab_file ../../vocab/google_research_uncased_L-12_H-768_A-12.txt --output_dir train_512k_uncased_msl128 --do_lower_case --max_seq_length 128 --max_predictions_per_seq 20

Generate validation CSV files:

python create_csv.py --metadata_files metadata/val_files.txt --input_files_prefix /path/to/raw/data/openwebtext --vocab_file ../../vocab/google_research_uncased_L-12_H-768_A-12.txt --output_dir val_uncased_msl128 --do_lower_case --max_seq_length 128 --max_predictions_per_seq 20

Phase `2`: MSL `512`

Generate training CSV files:

python create_csv.py --metadata_files metadata/train_512k.txt --input_files_prefix /path/to/raw/data/openwebtext --vocab_file ../../vocab/google_research_uncased_L-12_H-768_A-12.txt --output_dir train_512k_uncased_msl512 --do_lower_case --max_seq_length 512 --max_predictions_per_seq 80

Generate validation CSV files:

python create_csv.py --metadata_files metadata/val_files.txt --input_files_prefix /path/to/raw/data/openwebtext --vocab_file ../../vocab/google_research_uncased_L-12_H-768_A-12.txt --output_dir val_uncased_msl512 --do_lower_case --max_seq_length 512 --max_predictions_per_seq 80

The above-created CSV files are then used by the BertCSVDynamicMaskDataProcessor class to produce inputs to the model.

If you want to use static masking instead, generate the csv files with create_csv_static_masking.py and specify BertCSVDataProcessor as the data_processor under train_input in the config file.

BERT input function

If you want to use your own data loader with this example code, then this section describes the input data format expected by BertForPreTrainingModel class defined in model.py.

When you create your own custom BERT input function, you must ensure that your BERT input function produces a features dictionary as described in this section.

BERT features dictionary

The features dictionary has the following key/values:

input_ids: Input token IDs, padded with 0 to max_sequence_length.
- Shape: [batch_size, max_sequence_length].
- Type: torch.int32
attention_mask: Mask for padded positions. Has values 0 on the padded positions and 1 elsewhere.
- Shape: [batch_size, max_sequence_length]
- Type: torch.int32
token_type_ids: Segment IDs. Has values 0 on the positions corresponding to the first segment, and 1 on positions corresponding to the second segment.
- Shape: [batch_size, max_sequence_length]
- Type: torch.int32
masked_lm_positions: Positions of masked tokens in the input_ids tensor, padded with 0 to max_predictions_per_seq.
- Shape: [batch_size, max_predictions_per_seq]
- Type: torch.int32
masked_lm_weights: Mask for masked_lm_positions. Has values batch_size / num_masked_tokens_in_batch on the positions corresponding to actually masked tokens in the given sample, and 0.0 elsewhere. See the MLM Loss Scaling section for more detail.
- Dimensions: [batch_size, max_predictions_per_seq]
- Type: torch.float32
labels: Labels for computing the masked language modeling loss. Tokens with indices set to -100 are masked.
- Dimensions: [batch_size, max_sequence_length]
- Type: torch.int32
next_sentence_label: Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair where 0 indicates sequence B is a continuation of sequence A, 1 indicates sequence B is a random sequence.
- Dimensions: [batch_size,]
- Type: torch.int32

Note: You can omit the feature token_type_ids and the next_sentence_label if you are pre-training for MLM only. (in the yaml file, set disable_nsp to True under model params).

How to run

Parameter settings in YAML config file: The config YAML files are located in the configs directory. Before starting a pre-training run, make sure that in the YAML config file you are using:

The train_input.data_dir parameter points to the correct dataset, and
The train_input.max_sequence_length parameter corresponds to the sequence length --max_seq_length passed in create_csv.py or create_csv_mlm_only.py.

Phase-1 with MSL128 and phase-2 with MSL512: To continue the pre-training with MSL=512 from a checkpoint that resulted from the first phase of MSL=128, the parameter model.max_position_embeddings should be set to 512 for both the phases of pre-training.

Phase-1 with MSL128: If you would like to pre-train a model with MSL=128 and do not plan to use that model to continue pre-training with a longer sequence length, then you can change this model.max_position_embeddings parameter to 128.

To compile/validate, run train and eval on Cerebras System

Please follow the instructions on our quickstart in the Developer Docs.

To run train and eval on GPU/CPU

If running on a cpu or gpu, activate the environment from Python GPU Environment setup, and simply run:

python run.py CPU --mode train --params /path/to/yaml --model_dir /path/to/model_dir

python run.py GPU --mode train --params /path/to/yaml --model_dir /path/to/model_dir

Note: Change the command to --mode eval for evaluation.

MLM loss scaling

The MLM Loss is scaled by the number of masked tokens in the current batch. For numerical reasons this scaling is done in two steps:

First, the masked_lm_weights tensor is scaled by batch_size / num_masked_tokens_in_batch in the input pipeline.
Then, the final loss is scaled by 1 / batch_size.

Configs included for this model

In order to train the model, you need to provide a yaml config file. Below is the list of yaml config files included for this model implementation at configs folder. Also, feel free to create your own following these examples:

bert_base_*.yaml have the standard bert-base config with hidden_size=768, num_hidden_layers=12, num_heads=12.
bert_large_*.yaml have the standard bert-large config with hidden_size=1024, num_hidden_layers=24, num_heads=16.
Files with substrings MSL*** like bert_base_MSL128.yaml contain different maximum sequence length. Popular sequence lengths are 128, 512, 1024 provided in our example config files.
Files like roberta_*.yaml are provided to run RoBERTa model variants RoBERTa: A Robustly Optimized BERT Pretraining Approach.

Appendix

Reference: BERT paper on arXiv.org: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

Reference: RoBERTa paper on arXiv.org: RoBERTa: A Robustly Optimized BERT Pretraining Approach.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bert

bert

README.md

Bidirectional Transformers for Language Understanding

Overview of the model

Sequence of the steps to perform

Structure of the code

Before you start

Download and prepare the dataset

Download

OpenWebText dataset

Other datasets and download links

Extract

Allocate subsets for training and validation

Preprocess data

Relevant files

Create CSVs for `2`-phase pre-training

Phase `1`: MSL `128`

Phase `2`: MSL `512`

BERT input function

BERT features dictionary

How to run

To compile/validate, run train and eval on Cerebras System

To run train and eval on GPU/CPU

MLM loss scaling

Configs included for this model

Appendix

Files

bert

Directory actions

More options

Directory actions

More options

Latest commit

History

bert

Folders and files

parent directory

README.md

Bidirectional Transformers for Language Understanding

Overview of the model

Sequence of the steps to perform

Structure of the code

Before you start

Download and prepare the dataset

Download

OpenWebText dataset

Other datasets and download links

Extract

Allocate subsets for training and validation

Preprocess data

Relevant files

Create CSVs for 2-phase pre-training

Phase 1: MSL 128

Phase 2: MSL 512

BERT input function

BERT features dictionary

How to run

To compile/validate, run train and eval on Cerebras System

To run train and eval on GPU/CPU

MLM loss scaling

Configs included for this model

Appendix

Create CSVs for `2`-phase pre-training

Phase `1`: MSL `128`

Phase `2`: MSL `512`