Skip to content

Deep text/document classification for binary/multiclass/multilabel, single-sequence tasks. Based on Hedwig.

License

Notifications You must be signed in to change notification settings

arjunnlp/hedwig-anlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

75 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kraken

Corresponding authors (feature requests, bug reports): Dainis Boumber, [email protected]

Note: additional documentation is present throughout the library, in both hedwig and and hedwig-data directories, in the form of README.md files.

This library consists of 5 big parts:

  • Hedwig, a deep text multi-label classification library extended by us
  • Pytorch-hub, BERT, GPT, GPT-2 and TransformerXL as implemented in PyTorch-hub, plus many others
  • TextZoo with 20 or so models that can be considered classics
  • XLNet - defeater of BertTokenizer
  • FastAI - also known as ULMFit, AWD-LSTM and Transformer based library that along with GPT and ELMo was the first to us language modeling for pre-training. Has a lot of other trics and ideas, and gives BERT a run for its money with ease in many practical scenarios.
  • Cybertron - monstrous distributed model, TransormerXL and XLNet had a baby?

Full list: ,

FastText
NBSVM

KimCNN
MultiLayerCNN
MultiPerspectiveCNN
InceptionCNN
BILSTM
CharCNN
HAN
StackLSTM
LSTM with Attention (Self Attention / Quantum Attention)
RCNN
C-LSTM
ConS2S
Capsule
QuantumNN
TextCNN
Reg-BiLSTM

QRNN
XML-CNN
GPT Transformer
ULMFit
BERT
FastBERT
DocBERT
Hierarchical BERT

Transformer-XL
XLNet
GPT-2

Cybertron

Setup

Hedwig was designed for Python 3.6 and PyTorch 0.4.1

PyTorch recommends Anaconda for managing your environment. We'd recommend creating a custom environment as follows:

$ conda create --name hedwig python=3.6
$ conda activate hedwig

And installing PyTorch as follows:

$ conda install pytorch=0.4.1 cuda92 -c pytorch

Other Python packages we use can be installed via pip:

$ pip install -r requirements.txt

Code depends on data from NLTK (e.g., stopwords) so you'll have to download them. Run the Python interpreter and type the commands:

$ python
>>> import nltk
>>> nltk.download()

Furthermore, you want to get spacy stopwords and models

$ spacy download en

Spacy stop words tend to be superior to those offered by NLTK, sklearn, or StanfordNLP, although it's task-specific. They are accessible like so:

spacy_nlp = spacy.load('en_core_web_sm')
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

Models

Each model directory has a README.md with further details. All follow similar training pattern (differences are explained in their specific README.md files. Training is simple. For example, if you are using XML-CNN on MBTI dataset you would do something similar to this (these are of course not optimal hypermarameters):

python -m models.xml_cnn --mode non-static --dataset MBTI --batch-size 1024 --lr 0.01 --epochs 30 --dropout 0.5 --dynamic-pool-length 8 --seed 3435

These are of course sub-obtimal hyperparameters, just an example. Better results can be achieved by tuning various knobs, for example:

python -m models.xml_cnn --dataset MBTI --mode rand --batch-size 512 --lr 0.002 --epoch-decay 2 --dev-every 2 --epochs 10 --dropout 0.33 --dynamic-pool-length 16 --seed 3435

--mode hyper-parameter:

rand makes hedwig train the embeddings, non-static makes hedwig fine-tune existing pre-trained embeddings (typically ones you specify with this task in mind), static runs on pre-trained without modifying them, multichannel or not specifying mode option for the CNN models leads to multichannel training with one channel being static and the other one being fine-tuneable

results in micro-F1 of 0.76 vs 0.72 for the first example, and can be done with a smaller GPU due to twice smaller batch size. In general, bigger batch size leads to smoother optimization, but will not always give best results if the other parameters are not scaled correctly. For further discussion on this topic, we refer you to the following publications:

A Disciplined Approach to Neural Network Hyper-Parameters by Leslie Smith Don't Decay the Learning Rate, Increase the Batch Size by Stephen L. Smith et al.

For more information regarding each architecture, look inside each model's directory for a file called args.py that describes to see what parameters it takes, in addition to standard ones like learning rate and batch size which are shown in models/args.py Each model may have additional command line arguments you can use -- in this example I am only showing a few, for XML-CNN, for example, there is around two dozen things you can tune.

Datasets

You want to organize your directories as follows:

Organize your directory structure as follows:

hedwig-anlp
           ├── hedwig
           └── hedwig-data

This is done for you in this repository already, but double-check as a sanity measure.

hedwig-data, complete with default embeddings and glove-twitter-200 embeddings, default datasets and additional one called MBTI (for Marjan's paper, setup in Hedwig format and ready for use) can be found on backblaze account I setup a while ago -- search e-mail for logon credentials or ask me. Ideally you want to store stuff there, since it takes seconds to upload/dowbload 20-30GB csv, whereas Google Drive sometimes has issues with that. Plus, it's free 10 TB storage.

I had already setup access for big-box-1 and will follow up with the server and the other box. Use it like so:

$ b2

That will produce a list of commands and explanations. Most of hedwig-related stuff is in the bucket called "marjan"

$ b2 download-file-by-name marjan hedwig-data.tar.gz

Will get the hedwig-data directory.

$ b2 download-file-by-name marjan twitter.tar.gz

Will get you the MBTI data which has been "twitterized", e.g. preprocessed in the same manner Stanford NLP team did glove-twitter-200 data.

Alternative to the above approach is to download the Reuters, AAPD and IMDB datasets, along with word2vec embeddings from hedwig-data fomr github of University of Waterloo:

$ git clone https://git.uwaterloo.ca/jimmylin/hedwig-data.git

After cloning the hedwig-data repo, you need to unzip the embeddings and run the preprocessing script:

cd hedwig-data/embeddings/word2vec
gzip -d GoogleNews-vectors-negative300.bin.gz
python bin2txt.py GoogleNews-vectors-negative300.bin GoogleNews-vectors-negative300.txt

Adding new datasets

Summary: your dataset must comnform to spcifications defined by torchtext.data and torchtext.datasets - see torchtext documentation for a detailed guide.

  • Add a directory named after dataset name in hedwig-data/datasets/ Within it, you want to have 3 files: train.tsv, test.tsv, and dev.tsv.
  • Use add_dataset.ipynb notebook, found in the hedwig/utils/ directory, to pre-process your Pandas dataframe into tsv file that can be used by Hedwig.
  • Preprocessing is of course task-dependent, for examples see other datasets in hedwig-data, datasets/ directory, and utils directory, namely utils/add_dataset.ipynb, utils/add_dataset.py, utils/twitterize.py.
  • Add the code necessary to load, process, train and evaluate on your new dataset throughout the library. Small modifications may be necessary to roughly 25% of the library, but they are really simple to do. See how MBTI was added for an example, and copy-paste (while changing relevant things like number of labels, classes, etc.). You will want to add things, specifically, to datsets, each of the model's __main__.py and args.py files, and a few other places.

You want to have the data in TSV format. Make sure that your data does not contain " anywhere, as well as escape characters or invalid unicode, that the label column is separated from text by a tab, that neither label nor text is surrounded by quotation marks, and that there is only one \n -- at the end of each line

See utils/add_dataset.ipynb for how to do it if you encounter issues. Your preprocessing should, ideally, take into account the word embeddings used (by default most things not BERT are taking in word2vec, which has very specific preprocessing rules).

If you want to add, change or remove metrics, see hedwig/common/ directory.

Using different embeddings

Likewise, you can add different embeddings in the same manner to hedwig-data/embeddings. Just don't forget to tell the model what to use in command line args. Example using glove-twitter-200:

python -m models.xml_cnn --mode non-static  --dataset MBTI --batch-size 1024 --lr 0.002 --epochs 10 --dropout 0.7 --seed 3435 --word-vectors-dir ../hedwig-data/embeddings/glove-twitter-200 --word-vectors-file glove-twitter-200.txt --embed-dim 200 --words-dim 200 --weight-decay 0.0001

Recent additions:

  • MBTI Dataset and all the necessary modules
  • Utilities to preprocess and add new datasets saved from regular Pandas dataframe
  • Tokenizer and preprocessor that follows protocols used by StanfordNLP when making glove-twitter-200
  • Many bug fixes

TODO

  • Integrate preprocessing from totchtext and phase out use of alternatives when possible
  • Support for PyTorch 1.0/1.1
  • Support for Python 3.7
  • Support for mixed precision training for all models
  • NBSVM

Coming Soon

  • Distributed training for models that need it
  • Dedicated embeddings module
  • More automation to dataset addition process
  • Several SOTA classifiers

About

Deep text/document classification for binary/multiclass/multilabel, single-sequence tasks. Based on Hedwig.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published