Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] adapt classifier to use text and reference fraction data #17

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/generate_training_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ Here, we consider only data from 2016 onwards since before that the curation rul
Generate Rejected data
^^^^^^^^^^^^^^^^^^^^^^

The data for Rejected articles is harvested from the local inspire-next instance in a hackish way. The workflows themselves need to be modified in our local inspire-next setup. First, the file *inspire-next/inspirehep/modules/workflows/workflows/article.py* needs to be modified as specified in ``article.py``. We need to add another file *inspire-next/inspirehep/modules/workflows/tasks/makejson.py* with the contents of ``makejson.py``.
The data for Rejected articles is harvested from the local inspire-next instance in a hackish way. The workflows themselves need to be modified in our local inspire-next setup. First, we need to get the Core and Non-Core records ids. We can get by them running the script in ``get_core_and_non_core_recids.py`` from the inspirehep shell [1]_. This will produce two files: ``inspire_core_recids.txt`` and ``inspire_noncore_recids.txt``. Next, the file *inspire-next/inspirehep/modules/workflows/workflows/article.py* needs to be modified as specified in ``article.py``. We need to add another file *inspire-next/inspirehep/modules/workflows/tasks/makejson.py* with the contents of ``makejson.py``.

Once the workflow has been modified, we are ready to start the harvest. First, we need to deploy the harvest spiders. This can be done from the *inspire-next* instance folder:

Expand Down Expand Up @@ -97,14 +97,14 @@ This will open our favorite text editor (or we'll be required to set it). Add th

This will schedule a task to run every 15 minutes which will find and delete all files created before the last 30 minutes. It's recommended to schedule the cronjob after starting the harvests since the first harvests and workflows can take a few minutes to start. We can schedule the command to run more frequently or vice versa depending on our hardware specifications.

The harvest produces a file named *inspire_harvested_data.json*. We can monitor the harvest status in the local holdingpen. However, it doesn't contain information on whether the harvested records were Core, Non-Core, or Rejected. To find this, we need to extract the list of arXiv identifiers of Core and Non-Core records from our local inspire-next instance. From the *inspirehep shell* [1]_, copy the contents of ``get_core_and_noncore_arXiv_identifiers.py`` and execute. This will produce two files, *inspire_core_list.txt* and *inspire_noncore_list.txt*. These files will be used to filter out Core and Non-Core records from the harvested data.
The harvest produces a file named *inspire_harvested_data.json*. We can monitor the harvest status in the local holdingpen. However, it doesn't contain information on whether the harvested records were Core, Non-Core, or Rejected. To find this, we need to extract the list of arXiv identifiers of Core and Non-Core records from our local inspire-next instance. From the *inspirehep shell* [1]_, copy the contents of ``get_core_and_noncore_arXiv_identifiers.py`` and execute. This will produce two files, *inspire_core_arxiv_ids.txt* and *inspire_noncore_arxiv_ids.txt*. These files will be used to filter out Core and Non-Core records from the harvested data.

Combine the Core, Non-Core, and Rejected data
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The Core, Non-Core, and Rejected data can be combined by using the python script found at ``combine_core_noncore_rejected_data.py``. The different required files paths need to be specified in the file before running the script. Finally, this will produce the file *inspire_data.df* which is a Pandas DataFrame and which can be used for training and evaluation of the INSPIRE classifier. This file should be placed at the path specified in *inspire-classifier/inspire_classifier/config.py* in the variable *CLASSIFIER_DATAFRAME_PATH*.

The resulting pandas dataframe will contain 2 columns: *labels* and *text* where *text* is *title* and *abstract* concatenated with a *<ENDTITLE>* token in between.
The resulting pandas dataframe will contain 8 columns: *core_references_fraction_first_order*, *core_references_fraction_second_order*, *noncore_references_fraction_first_order*, *noncore_references_fraction_second_order*, *total_first_order_references*, *total_second_order_references*, *labels*, and *text* where *text* is *title* and *abstract* concatenated with a *<ENDTITLE>* token in between.



Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,16 +24,16 @@
import numpy as np
import pandas as pd

inspire_core_list_path = 'inspire_core_list.txt'
inspire_noncore_list_path = 'inspire_noncore_list.txt'
inspire_core_arxiv_ids_path = 'inspire_core_arxiv_ids.txt'
inspire_noncore_arxiv_ids_path = 'inspire_noncore_arxiv_ids.txt'
inspire_harvested_data_path = 'inspire_harvested_data.jsonl'
inspire_core_data_path = 'inspire_core_records.jsonl'
inspire_noncore_data_path = 'inspire_noncore_records.jsonl'
save_path = 'inspire_data.df'

with open(inspire_core_list_path, 'r') as fd:
with open(inspire_core_arxiv_ids_path, 'r') as fd:
inspire_core_arxiv_ids = set(arxiv_id.strip() for arxiv_id in fd.readlines())
with open(inspire_noncore_list_path, 'r') as fd:
with open(inspire_noncore_arxiv_ids_path, 'r') as fd:
inspire_noncore_arxiv_ids = set(arxiv_id.strip() for arxiv_id in fd.readlines())

def rejected_data(harvested_data_path):
Expand Down Expand Up @@ -67,5 +67,5 @@ def noncore_data():

inspire_data = pd.concat([rejected_df, noncore_df, core_df], ignore_index=True)
inspire_data['text'] = inspire_data['title'] + ' <ENDTITLE> ' + inspire_data['abstract']
inspire_data = inspire_data[['labels', 'text']]
inspire_data = inspire_data.drop(['title', 'abstract'], axis=1)
inspire_data.to_pickle(save_path)
101 changes: 93 additions & 8 deletions generate_training_data_scripts/generate_core_and_noncore_data.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# -*- coding: utf-8 -*-
#
# This file is part of INSPIRE.
# Copyright (C) 2014-2018 CERN.
# Copyright (C) 2014-2019 CERN.
#
# INSPIRE is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
Expand All @@ -20,12 +20,14 @@
# granted to it by virtue of its status as an Intergovernmental Organization
# or submit itself to any jurisdiction.

"""
Get Core and Non-Core records starting from an earliest date from INSPIRE.
Please run the code in this snippet from within the inspirehep shell.
"""
from __future__ import absolute_import, division, print_function

import datetime
from inspire_dojson.utils import get_recid_from_ref
from inspirehep.utils.record_getter import (
get_db_record,
RecordGetterError
)
from invenio_db import db
from invenio_records.models import RecordMetadata
import json
Expand All @@ -40,8 +42,75 @@


STARTING_DATE = datetime.datetime(2016, 1, 1, 0, 0, 0)
inspire_core_recids_path = 'inspire_core_recids.txt'
inspire_noncore_recids_path = 'inspire_noncore_recids.txt'


with open(inspire_core_recids_path, 'r') as fd:
core_recids = set(int(recid.strip()) for recid in fd.readlines())
with open(inspire_noncore_recids_path, 'r') as fd:
noncore_recids = set(int(recid.strip()) for recid in fd.readlines())


def get_first_order_core_noncore_reference_fractions(references):
num_core_refs = 0
num_noncore_refs = 0
if references:
for reference in references:
recid = get_recid_from_ref(reference.get('record'))
if recid in core_recids:
num_core_refs += 1
elif recid in noncore_recids:
num_noncore_refs += 1
total_first_order_references = len(references)
core_references_fraction = num_core_refs / total_first_order_references
noncore_references_fraction = num_noncore_refs / total_first_order_references
else:
core_references_fraction, noncore_references_fraction = 0.0, 0.0
total_first_order_references = 0

return core_references_fraction, noncore_references_fraction, total_first_order_references


def get_second_order_core_noncore_reference_fractions(references):
num_core_refs = 0
num_noncore_refs = 0
total_second_order_references = 0
first_order_recids = get_references_recids(references)
missing_recids = set()
if first_order_recids:
for f_recid in first_order_recids:
if not f_recid in missing_recids:
try:
second_order_references = get_db_record('lit', f_recid).get('references')
except RecordGetterError:
missing_recids.add(f_recid)
continue
if second_order_references:
total_second_order_references += len(second_order_references)
second_order_recids = get_references_recids(second_order_references)
for s_recid in second_order_recids:
if s_recid in core_recids:
num_core_refs += 1
elif s_recid in noncore_recids:
num_noncore_refs += 1
if total_second_order_references > 0:
core_references_fraction = num_core_refs / total_second_order_references
noncore_references_fraction = num_noncore_refs / total_second_order_references
else:
core_references_fraction, noncore_references_fraction = 0.0, 0.0

return core_references_fraction, noncore_references_fraction, total_second_order_references


def get_references_recids(references):
recids = None
if references:
recids = [get_recid_from_ref(reference.get('record')) for reference in references \
if reference.get('record')]
return recids

base_query = db.session.query(RecordMetadata).with_entities(RecordMetadata.json['titles'][0]['title'], RecordMetadata.json['abstracts'][0]['value'])
base_query = db.session.query(RecordMetadata).with_entities(RecordMetadata.json['titles'][0]['title'], RecordMetadata.json['abstracts'][0]['value'], RecordMetadata.json['references'])
filter_by_date = RecordMetadata.created >= STARTING_DATE
has_title_and_abstract = and_(type_coerce(RecordMetadata.json, JSONB).has_key('titles'), type_coerce(RecordMetadata.json, JSONB).has_key('abstracts'))
filter_deleted_records = or_(not_(type_coerce(RecordMetadata.json, JSONB).has_key('deleted')), not_(RecordMetadata.json['deleted'] == cast(True, JSONB)))
Expand All @@ -54,14 +123,30 @@
noncore_query_results = base_query.filter(filter_by_date, only_noncore_records, has_title_and_abstract, filter_deleted_records, only_literature_collection)

with open('inspire_core_records.jsonl', 'w') as fd:
for title, abstract in core_query_results:
for title, abstract, references in core_query_results:
core_references_fraction_first_order, noncore_references_fraction_first_order, total_first_order_references = get_first_order_core_noncore_reference_fractions(references)
core_references_fraction_second_order, noncore_references_fraction_second_order, total_second_order_references = get_second_order_core_noncore_reference_fractions(references)
fd.write(json.dumps({
'title': title,
'abstract': abstract,
'core_references_fraction_first_order': core_references_fraction_first_order,
'noncore_references_fraction_first_order': noncore_references_fraction_first_order,
'core_references_fraction_second_order': core_references_fraction_second_order,
'noncore_references_fraction_second_order': noncore_references_fraction_second_order,
'total_first_order_references': total_first_order_references,
'total_second_order_references': total_second_order_references,
}) + '\n')
with open('inspire_noncore_records.jsonl', 'w') as fd:
for title, abstract in noncore_query_results:
for title, abstract, references in noncore_query_results:
core_references_fraction_first_order, noncore_references_fraction_first_order, total_first_order_references = get_first_order_core_noncore_reference_fractions(references)
core_references_fraction_second_order, noncore_references_fraction_second_order, total_second_order_references = get_second_order_core_noncore_reference_fractions(references)
fd.write(json.dumps({
'title': title,
'abstract': abstract,
'core_references_fraction_first_order': core_references_fraction_first_order,
'noncore_references_fraction_first_order': noncore_references_fraction_first_order,
'core_references_fraction_second_order': core_references_fraction_second_order,
'noncore_references_fraction_second_order': noncore_references_fraction_second_order,
'total_first_order_references': total_first_order_references,
'total_second_order_references': total_second_order_references,
}) + '\n')
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# -*- coding: utf-8 -*-
#
# This file is part of INSPIRE.
# Copyright (C) 2014-2018 CERN.
# Copyright (C) 2014-2019 CERN.
#
# INSPIRE is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
Expand All @@ -25,13 +25,16 @@
Please run the code in this snippet from within the inspirehep shell.
"""

from __future__ import absolute_import, division, print_function

from invenio_search import current_search_client as es
from elasticsearch.helpers import scan
import numpy as np


core = []
non_core = []


for hit in scan(es, query={"query": {"exists": {"field": "arxiv_eprints"}}, "_source": ["core", "arxiv_eprints"]},
index='records-hep', doc_type='hep'):
source = hit['_source']
Expand All @@ -41,7 +44,7 @@
else:
non_core.append(arxiv_eprint)

with open('inspire_core_list.txt', 'w') as fd:
with open('inspire_core_arxiv_ids.txt', 'w') as fd:
fd.writelines("{}\n".format(arxiv_id) for arxiv_id in core)
with open('inspire_noncore_list.txt', 'w') as fd:
fd.writelines("{}\n".format(arxiv_id) for arxiv_id in non_core)
with open('inspire_noncore_arxiv_ids.txt', 'w') as fd:
fd.writelines("{}\n".format(arxiv_id) for arxiv_id in non_core)
45 changes: 45 additions & 0 deletions generate_training_data_scripts/get_core_and_non_core_recids.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# -*- coding: utf-8 -*-
#
# This file is part of INSPIRE.
# Copyright (C) 2014-2019 CERN.
#
# INSPIRE is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# INSPIRE is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with INSPIRE. If not, see <http://www.gnu.org/licenses/>.
#
# In applying this license, CERN does not waive the privileges and immunities
# granted to it by virtue of its status as an Intergovernmental Organization
# or submit itself to any jurisdiction.

from __future__ import absolute_import, division, print_function

from invenio_search import current_search_client as es
from elasticsearch.helpers import scan


core = []
non_core = []

for hit in scan(es, query={"query": {"exists": {"field": "control_number"}}, "_source": ["core", "control_number"]},
index='records-hep', doc_type='hep'):
source = hit['_source']
control_number = source['control_number']
if source.get('core') == True:
core.append(control_number)
else:
non_core.append(control_number)

with open('inspire_core_recids.txt', 'w') as fd:
fd.writelines("{}\n".format(recid) for recid in core)
with open('inspire_noncore_recids.txt', 'w') as fd:
fd.writelines("{}\n".format(recid) for recid in non_core)

Loading