Skip to content
This repository has been archived by the owner on Apr 6, 2023. It is now read-only.

Latest commit

 

History

History
34 lines (26 loc) · 1.3 KB

README.md

File metadata and controls

34 lines (26 loc) · 1.3 KB

Movie Review Classification with spaCy

Welcome! This repo contains some files to get a spaCy model working for classifying movie reviews.

Prerequisites

  • A python env setup with the following installed:
    • spaCy
    • numpy
    • pandas
    • scikit-learn
    • tqdm

Setup

  1. Download data from Kaggle for the IMDB 50k Movie Reviews Dataset.
  2. Place the CSV file in the root folder of this repo and rename it IMDB Dataset.csv
  3. Preprocess the data with create_data.py
  4. Create the spaCy configuration:
    1. Generate a base config for English with only textcat selected from https://spacy.io/usage/training#quickstart
    2. Copy the configuration to a file called base_config in the root of the cloned repo.
    3. Edit the train data and dev data fields in the config so that they reference the spacy docbin files generated in step 3.
  5. Generate full config, train, and evaluate the model:
# generate the full spacy config file
python -m spacy init fill-config ./base_config.cfg ./config.cfg

# train spacy model and store in output dir
python -m spacy train config.cfg --output ./output 

# evaluate trained model
python -m spacy evaluate ./output/model-best/ ./data/test.spacy -o metrics.json