This repo contains:
- Tools for extracting toponyms (and lemmata) from newspaper articles downloaded from LexisNexis.
- The results that were collected with these tools for a research on toponyms in news on Brexit in Dutch newspapers.
- A short write up on this case study. Check out the interactive map here.
There are three main scripts that were used to generate the data for this case study. Each script contains further documentation on how they should be used:
- Build NER model :Create a spaCy NER-model for extracting toponyms
- Build data set: Extract text and meta data from LexisNexis files
- Extract toponyms: Apply the model to the data set and extract statistics from it
The PhraseAnnotator
in annotation_tools can be used to annotate the NER-results.
This tool currently extracts two main statistics for each geographical category defined in the [MODEL] chapter of config.ini:
- Total frequency
- Article counts
These scripts will generally store results in Python's pickle format. In order to make the results of this study generally available the following data has been added to the repo as csv-files (some have been zipped):
- The metadata for the lexisnexis dataset
- The statistics of the toponym recognition
- The statistics of the lemmata recognition
- The annotation data
The data and results have been made available through an online jupyter notebook. Access the notebook by clicking this button: