Welcome! This repo contains some files to get a spaCy model working for classifying movie reviews.
- A python env setup with the following installed:
- spaCy
- numpy
- pandas
- scikit-learn
- tqdm
- Download data from Kaggle for the IMDB 50k Movie Reviews Dataset.
- Place the CSV file in the root folder of this repo and rename it
IMDB Dataset.csv
- Preprocess the data with
create_data.py
- Create the spaCy configuration:
- Generate a base config for English with only textcat selected from https://spacy.io/usage/training#quickstart
- Copy the configuration to a file called base_config in the root of the cloned repo.
- Edit the train data and dev data fields in the config so that they reference the spacy docbin files generated in step 3.
- Generate full config, train, and evaluate the model:
# generate the full spacy config file
python -m spacy init fill-config ./base_config.cfg ./config.cfg
# train spacy model and store in output dir
python -m spacy train config.cfg --output ./output
# evaluate trained model
python -m spacy evaluate ./output/model-best/ ./data/test.spacy -o metrics.json