Skip to content

Latest commit




Bidirectional Transformers for Language Understanding

Overview of the model

Bidirectional Transformers for Language Understanding (BERT) is an encoder-only transformer-based model designed for natural language understanding. This directory contains implementations of the BERT model. It uses a stack of transformer blocks with multi-head attention followed by a multi-layer perceptron feed-forward network. We support removing next-sentence-prediction (NSP) loss from BERT training processing with only masked-language-modeling (MLM) loss. The training pipeline has 2 phases. We first train with maximum sequence length of 128 and then train with maximum sequence length of 512. More details of the model can be found in the appendix.

An overview of the model diagram is here: drawing

We also support the RoBERTa model, which is very similar to BERT in terms of architectural design. In order to improve the results on BERT, some changes are made with objective functions (removing NSP), batch sizes, sequence lengths and masking patterns (dynamic vs. static). Difference between dynamic and static masking is discussed here.

Sequence of the steps to perform

This document walks you through an example showing the steps to run a BERT pre-training on the Cerebras Wafer-Scale Cluster (and on GPUs) using the code in this repo.

Note: You can use any subset of the steps described in this example. For example, if you already downloaded your preferred dataset and created the CSV files, then you can skip the section Preprocess data.

The following block diagram shows a high-level view of the sequence of steps you will perform in this example.

Running BERT on the Cerebras System

Structure of the code

  • configs/: YAML configuration files.
  • fine_tuning/: Code for fine-tuning the BERT model.
  • data/nlp/bert: Input pipeline implementation based on the Open Web Text dataset. This directory also contains the scripts you can use to download and prepare the Open Web Text dataset. Vocab files are located in models/vocab.
  • Provides a common wrapper for all models under class BertForPreTrainingModel, which interfaces with model-specific code. In this repo the model-specific code, i.e., model architecture is in and the finetuning model architectures are in This wrapper provides a common interface for handling the function call of the model with its specific data format. It also provides a common interface to use the same format of configuration files from configs/ to construct various models.
  • The entry point to the data input pipeline code.
  • Training script. Performs training and validation.
  • Miscellaneous helper functions.

Before you start

This example walk-through consists of two main steps:

  1. Prepare the dataset.
  2. Perform the pre-training.

This example follows the standard practice of 2-phase pre-training for BERT models. In the 2-phase pre-training, the model is:

  • First pre-trained with the maximum sequence length (MSL) of 128 for 90% of the steps.
  • Then the final 10% of the steps are pre-trained with the MSL of 512.

CSV files for each phase: You will need to create different CSV files for each of these 2 phases of pre-training, details here.

Download and prepare the dataset


OpenWebText dataset

The scripts for downloading and preprocessing OpenWebText dataset: are located here.

Start by downloading the OWT dataset by accessing the following link from a browser:

and manually download the tar.xz file from that location to your preferred local directory.

NOTE: Currently a server side issue with the OWT site prevents using the below shell script to download this tar file. We will update the script when this issue resolved.

Other datasets and download links

RoBERTa is trained with OpenWebText and the following datasets in the original paper:

  1. Book corpus
  2. English Wikipedia
  3. CC-News
  4. Stories

NOTE: In order to replicate our results, please use the dataset provided by us. It is a combination of a few of the datasets above but not all. If you want to replicate the results of the original paper, please go to the links and download the required datasets from there.


To extract the above-downloaded files, run shell script:


NOTE: The may take a while to complete, as it unpacks 40GB of data (8,013,770 documents).

Upon completion, the script will produce openwebtext folder in the same folder where the tar file is located. The openwebtext folder will have multiple subfolders, each containing a collection of *.txt files of raw text data (one document per .txt file).

NOTE: For other datasets that are used with RoBERTa, you can download the dataset from the links and extract to your preferred location with the corresponding extraction commands.

Allocate subsets for training and validation

In the next step, you will create two subsets of extracted txt files, one for training and the second for validation. These two subsets are then used to create CSV files that will be used for pre-training.

IMPORTANT: The training and validation subsets must contain mutually exclusive .txt files.

Proceed as follows:

Define metadata files that contain paths to subsets of documents in the openwebtext folder to be used for training or validation.

For training, in this tutorial we use a subset of 512,000 documents. The associated metadata file can be found in metadata/train_512k.txt.

For validation, we choose 5,000 documents that are not in the training set. The metadata file for validation can be found in metadata/val_files.txt.

NOTE: You are free to create your own metadata files that define your train and validation data subsets, with your preferred content and sizes. You can also create a data subset for the test.

Next, using the metadata file that defines a data subset (for training or for validation), create CSV files containing masked sequences and labels derived from the data subset.

Preprocess data

The preprocessing comprises of creating CSV files containing sequences and labels.


If you do not have spaCy, the natural language processing (NLP) library, then install it with the following commands:

pip install spacy
python -m spacy download en

Relevant files

To create CSV files containing sequences and labels derived from the data subset, you will use the Python utility located in the data_preparation/nlp/bert directory.

In addition, script can be used to create data without the NSP labels.

Note that or is intended to work with dynamic masking so they do not create masking during preprocessing. Dynamic masking is created on the fly in the BertCSVDynamicMaskDataProcessor. script can be used to create dataset with static masking, which performs masking only once during data preprocessing so on each epoch the same input masks are applied. script can be used to create static masking without NSP labels.

Note that,

  • or makes csv files to be used with BertCSVDynamicMaskDataProcessor.
  • or makes csv files to be used with
  • loads the masking created during data preprocessing, so on each epoch the same input masks are applied.
  • creates the masking on the fly every time the data is loaded, so the input masks of sentences are different on each epoch.

Refer to for more details.


The command-line syntax to run the Python utility is as follows:

python --metadata_files /path/to/metadata_file.txt --input_files_prefix /path/to/raw/data/openwebtext --vocab_file /path/to/vocab.txt --do_lower_case


  • metadata_file.txt is a metadata file containing a list of paths to documents, and
  • /path/to/vocab.txt contains a vocabulary file to map WordPieces to word IDs.

For example, you can use the supplied metadata/train_512k.txt as an input to generate a training set based on 512,000 documents. Sample vocabularies can be found in the models/vocab folder.

Some arguments and their usage is listed below:

  • --metadata_files: path to text file containing a list of file names corresponding to the raw input documents to be processed and stored; can handle multiple metadata files separated by space (default: None).

  • --input_files_prefix: prefix to be added to paths of the input files. For example, can be a directory where raw data is stored if the paths are relative.

  • --vocab_file: path to vocabulary (default: None).

  • --split_num: number of input files to read at a given time for processing (default: 1000).

  • --do_lower_case: pass this flag to lower case the input text; should be True for uncased models and False for cased models (default: False).

  • --max_seq_length: maximum sequence length (default: 128).

  • --masked_lm_prob: masked LM probability (default: 0.15).

  • --max_predictions_per_seq: maximum number of masked LM predictions per sequence (default: 20).

  • --output_dir: directory where CSV files will be stored (default: ./csvfiles/).

  • --seed: random seed (default: 0).

For more details, run the command: python --help (or python --help if creating statically masked data).

Create CSVs for 2-phase pre-training

For the 2-phase BERT pre-training that we are following in this tutorial, you need to generate the following datasets:

  • A training dataset for the first phase using sequences with maximum length of 128.
  • A second training dataset for the second phase using sequences with maximum sequence length of 512.

If you want to run validation, then:

  • Two additional validation datasets, one for each phase.

In total, to run training and validation for both the pre-training phases, you will need four datasets: a training and a validation dataset for phase 1 with MSL 128, and a training and a validation dataset for phase 2 with MSL 512.

Proceed as follows to run the following commands:

Phase 1: MSL 128

Generate training CSV files:

python --metadata_files metadata/train_512k.txt --input_files_prefix /path/to/raw/data/openwebtext --vocab_file ../../vocab/google_research_uncased_L-12_H-768_A-12.txt --output_dir train_512k_uncased_msl128 --do_lower_case --max_seq_length 128 --max_predictions_per_seq 20

Generate validation CSV files:

python --metadata_files metadata/val_files.txt --input_files_prefix /path/to/raw/data/openwebtext --vocab_file ../../vocab/google_research_uncased_L-12_H-768_A-12.txt --output_dir val_uncased_msl128 --do_lower_case --max_seq_length 128 --max_predictions_per_seq 20

Phase 2: MSL 512

Generate training CSV files:

python --metadata_files metadata/train_512k.txt --input_files_prefix /path/to/raw/data/openwebtext --vocab_file ../../vocab/google_research_uncased_L-12_H-768_A-12.txt --output_dir train_512k_uncased_msl512 --do_lower_case --max_seq_length 512 --max_predictions_per_seq 80

Generate validation CSV files:

python --metadata_files metadata/val_files.txt --input_files_prefix /path/to/raw/data/openwebtext --vocab_file ../../vocab/google_research_uncased_L-12_H-768_A-12.txt --output_dir val_uncased_msl512 --do_lower_case --max_seq_length 512 --max_predictions_per_seq 80

The above-created CSV files are then used by the BertCSVDynamicMaskDataProcessor class to produce inputs to the model.

If you want to use static masking instead, generate the csv files with and specify BertCSVDataProcessor as the data_processor under train_input in the config file.

BERT input function

If you want to use your own data loader with this example code, then this section describes the input data format expected by BertForPreTrainingModel class defined in

When you create your own custom BERT input function, you must ensure that your BERT input function produces a features dictionary as described in this section.

BERT features dictionary

The features dictionary has the following key/values:

  • input_ids: Input token IDs, padded with 0 to max_sequence_length.

    • Shape: [batch_size, max_sequence_length].
    • Type: torch.int32
  • attention_mask: Mask for padded positions. Has values 0 on the padded positions and 1 elsewhere.

    • Shape: [batch_size, max_sequence_length]
    • Type: torch.int32
  • token_type_ids: Segment IDs. Has values 0 on the positions corresponding to the first segment, and 1 on positions corresponding to the second segment.

    • Shape: [batch_size, max_sequence_length]
    • Type: torch.int32
  • masked_lm_positions: Positions of masked tokens in the input_ids tensor, padded with 0 to max_predictions_per_seq.

    • Shape: [batch_size, max_predictions_per_seq]
    • Type: torch.int32
  • masked_lm_weights: Mask for masked_lm_positions. Has values batch_size / num_masked_tokens_in_batch on the positions corresponding to actually masked tokens in the given sample, and 0.0 elsewhere. See the MLM Loss Scaling section for more detail.

    • Dimensions: [batch_size, max_predictions_per_seq]
    • Type: torch.float32
  • labels: Labels for computing the masked language modeling loss. Tokens with indices set to -100 are masked.

    • Dimensions: [batch_size, max_sequence_length]
    • Type: torch.int32
  • next_sentence_label: Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair where 0 indicates sequence B is a continuation of sequence A, 1 indicates sequence B is a random sequence.

    • Dimensions: [batch_size,]
    • Type: torch.int32

Note: You can omit the feature token_type_ids and the next_sentence_label if you are pre-training for MLM only. (in the yaml file, set disable_nsp to True under model params).

How to run

Parameter settings in YAML config file: The config YAML files are located in the configs directory. Before starting a pre-training run, make sure that in the YAML config file you are using:

  • The train_input.data_dir parameter points to the correct dataset, and
  • The train_input.max_sequence_length parameter corresponds to the sequence length --max_seq_length passed in or

Phase-1 with MSL128 and phase-2 with MSL512: To continue the pre-training with MSL=512 from a checkpoint that resulted from the first phase of MSL=128, the parameter model.max_position_embeddings should be set to 512 for both the phases of pre-training.

Phase-1 with MSL128: If you would like to pre-train a model with MSL=128 and do not plan to use that model to continue pre-training with a longer sequence length, then you can change this model.max_position_embeddings parameter to 128.

To compile/validate, run train and eval on Cerebras System

Please follow the instructions on our quickstart in the Developer Docs.

To run train and eval on GPU/CPU

If running on a cpu or gpu, activate the environment from Python GPU Environment setup, and simply run:

python CPU --mode train --params /path/to/yaml --model_dir /path/to/model_dir


python GPU --mode train --params /path/to/yaml --model_dir /path/to/model_dir

Note: Change the command to --mode eval for evaluation.

MLM loss scaling

The MLM Loss is scaled by the number of masked tokens in the current batch. For numerical reasons this scaling is done in two steps:

  • First, the masked_lm_weights tensor is scaled by batch_size / num_masked_tokens_in_batch in the input pipeline.
  • Then, the final loss is scaled by 1 / batch_size.

Configs included for this model

In order to train the model, you need to provide a yaml config file. Below is the list of yaml config files included for this model implementation at configs folder. Also, feel free to create your own following these examples:

  • bert_base_*.yaml have the standard bert-base config with hidden_size=768, num_hidden_layers=12, num_heads=12.
  • bert_large_*.yaml have the standard bert-large config with hidden_size=1024, num_hidden_layers=24, num_heads=16.
  • Files with substrings MSL*** like bert_base_MSL128.yaml contain different maximum sequence length. Popular sequence lengths are 128, 512, 1024 provided in our example config files.
  • Files like roberta_*.yaml are provided to run RoBERTa model variants RoBERTa: A Robustly Optimized BERT Pretraining Approach.


Reference: BERT paper on BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

Reference: RoBERTa paper on RoBERTa: A Robustly Optimized BERT Pretraining Approach.