Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expanded README, added requirements.txt #40

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 83 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,89 @@
**Status:** Archive (code is provided as-is, no updates expected)
## `finetune-transformer-lm`: Code for Improving Language Understanding by Generative pre-Training

# finetune-transformer-lm
Code and model for the paper "Improving Language Understanding by Generative Pre-Training"
This project contains the code and model for the paper ["Improving Language Understanding by Generative Pre-Training"](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf).

**Note: This project is no longer actively developed. This code is provided as-is, and no updates are expected.**

From the abstract:

> We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task... we achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE), and 1.5% on textual entailment (MultiNLI).

The blog post describing this work is ["Improving Language Understanding with Unsupervised Learning"](https://blog.openai.com/language-unsupervised/).

Authors: Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever

### License

This code is Copyright OpenAI and published under the MIT License.

### Requirements

This code is verified to run on Python 2.7 and 3.3.6 in a clean conda environment. It requires the following modules:

```
ftfy
joblib
numpy
pandas
sklearn
spacy
tensorflow
tqdm
```

### Setup

To install requirements, run:

```
pip install -r requirements.txt
```

#### Python 2.7

Note: `tqdm` requires `ftfy`, which dropped Python 2 support after version 4.4.3. This is handled in `requirements.txt`.

#### Spacy Models

You need to download the `en` model for spacy:

```
python -m spacy download en
```

#### Data

You need to download all of the ROCStories 2016 datasets to train the model. Doing so requires filling out a form so the data's creators can track who is using it. They can be found at [website](http://cs.rochester.edu/nlp/rocstories/). The location of the data is a command line argument.

Once you've downloaded them, the files should look something like this:

```
data/ROCStories__spring2016 - ROCStories_spring2016.csv
data/cloze_test_test__spring2016 - cloze_test_ALL_test.csv
data/cloze_test_test__spring2016 - cloze_test_ALL_test.tsv
data/cloze_test_val__spring2016 - cloze_test_ALL_val.csv
data/cloze_test_val__spring2016 - cloze_test_ALL_val.tsv
```

#### Model

The model is precomputed and stored in the [`model`](model) directory.

### Training

Currently this code implements the ROCStories Cloze Test result reported in the paper by running:
`python train.py --dataset rocstories --desc rocstories --submit --analysis --data_dir [path to data here]`

```
python train.py --dataset rocstories --desc rocstories --submit --analysis --data_dir [path to data here]
```

You can put the data files anywhere and change the `--data-dir` value.

Note: The code is currently non-deterministic due to various GPU ops. The median accuracy of 10 runs with this codebase (using default hyperparameters) is 85.8% - slightly lower than the reported single run of 86.5% from the paper.

The ROCStories dataset can be downloaded from the associated [website](http://cs.rochester.edu/nlp/rocstories/).
### Testing

This code was tested by training the model in Python 2.7 and 3 on Ubuntu Linux 17.10/artful with the 4.13.0-46-generic kernel. Each of the two Python processes consumed 24GB of RAM (12GB remained free) on a 12 core 64GB/RAM machine with a GTX 1080, and used all CPU 12 cores in addition to the GPUs. It ran overnight.

Your results may vary.

11 changes: 11 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
ftfy==4.4.3; python_version < '3.0'
ftfy>= 5.0.0; python_version >= '3.0'
joblib
numpy
pandas
sklearn
spacy
tensorflow
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in this repo is not compatible with the latest TF, so some restriction here is needed.

tqdm
joblib
numpy