Modeling citation worthiness by using attention-based bidirectional long short-term memory networks and interpretable models
This is the repository contains the code and data for the Scientometrics paper:
Zeng, T., Acuna, D.E. (2020), Modeling citation worthiness by using attention-based Bidirectional Long Short-Term Memory networks and interpretable models, Scientometrics, 124(1), 399–428
ACL‑ARC dataset: please refer to Bonab et al., 2018 for details. We downloaded a copy of the dataset, adjusted some fields. You can download it from Figshare: 10.6084/m9.figshare.12573872.
PMOA-CITE dataset: please download 1M sentences from Figshare: 10.6084/m9.figshare.12547574
PMOA-CITE and ACL-ARC combined: please download it from Figshare: 10.6084/m9.figshare.12573974
The code requires the following packages:
- allennlp==0.9.0
- scikit-learn==0.21.2
All the experiments configuration files are located in cite-worthiness/experiments folder, to run an experiment:
- Please find the fields train_data_path, validation_data_path and test_data_path in each jsonnet file, and change the value to the path where you store the datasets mentioned above.
- Find the cuda_device field, change it to -1 if you're using a CPU, otherwise the CUDA device number.
- Run the command:
allennlp train /path/to/experiment/configuration/jsonnet/file -s ../path/to/serialization/dir/ --include-package citation_worthiness
Please refer to allennlp documentation for the use of train
command
Please visit our live demo at https://cite-worthiness.scienceofscience.org/, just input some sentences, the tool will predict the probabilities of needing a citation.
If you use the dataset and code on this repo, please cite our work: Modeling citation worthiness by using attention-based bidirectional long short-term memory networks and interpretable models.
@Article{Zeng2020,
author={Zeng, Tong and Acuna, Daniel E.},
title={Modeling citation worthiness by using attention-based bidirectional long short-term memory networks and interpretable models},
journal={Scientometrics},
year={2020},
month={Jul},
day={01},
volume={124},
number={1},
pages={399-428},
issn={1588-2861},
publisher = {Springer International Publishing},
doi={10.1007/s11192-020-03421-9},
url={https://doi.org/10.1007/s11192-020-03421-9}
}
The datasets and code are developed in the Science of Science and Computational Discovery Lab in the School of Information Studies, Syracuse University.
The code in this repo uses the MIT license.