Hansard Speaker Name Disambiguation

hansard-speakers is a data processing pipeline for disambiguating speaker names in the 19th-century British Parliamentary debates, also known as Hansard. The final dataset produced by this pipeline can be downloaded here (coming soon). An article describing our disambiguation efforts can be read here (coming soon). You can view the code to scrape and format our version of the Hansard corpus from the original XML files hosted by Historic Hansard.

Steps:

Clone the repo and cd into hansard-speakers
Start the disambiguation process.

Over terminal: cythonize -3 -i util/*.pyx python3 run.py --cores <n> where "n" must be a minimum of three cores

Over SLURM: sbatch job.sbatch

Requirements:

Our disambiguation process uses lower-level processing for computational speed and efficency. To run hansard-speakers, users must have Cython installed as well as Python.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Hansard Speaker Name Disambiguation

Steps:

Requirements:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Hansard Speaker Name Disambiguation

Steps:

Requirements: