inspirehep · salmanmaq · Feb 27, 2019 · Feb 27, 2019
diff --git a/docs/generate_training_data.rst b/docs/generate_training_data.rst
@@ -52,7 +52,7 @@ Here, we consider only data from 2016 onwards since before that the curation rul
 Generate Rejected data
 ^^^^^^^^^^^^^^^^^^^^^^
 
-The data for Rejected articles is harvested from the local inspire-next instance in a hackish way. The workflows themselves need to be modified in our local inspire-next setup. First, the file *inspire-next/inspirehep/modules/workflows/workflows/article.py* needs to be modified as specified in ``article.py``. We need to add another file *inspire-next/inspirehep/modules/workflows/tasks/makejson.py* with the contents of ``makejson.py``.
+The data for Rejected articles is harvested from the local inspire-next instance in a hackish way. The workflows themselves need to be modified in our local inspire-next setup. First, we need to get the Core and Non-Core records ids. We can get by them running the script in ``get_core_and_non_core_recids.py`` from the inspirehep shell [1]_. This will produce two files: ``inspire_core_recids.txt`` and ``inspire_noncore_recids.txt``. Next, the file *inspire-next/inspirehep/modules/workflows/workflows/article.py* needs to be modified as specified in ``article.py``. We need to add another file *inspire-next/inspirehep/modules/workflows/tasks/makejson.py* with the contents of ``makejson.py``.
 
 Once the workflow has been modified, we are ready to start the harvest. First, we need to deploy the harvest spiders. This can be done from the *inspire-next* instance folder:
 
@@ -97,14 +97,14 @@ This will open our favorite text editor (or we'll be required to set it). Add th
 
 This will schedule a task to run every 15 minutes which will find and delete all files created before the last 30 minutes. It's recommended to schedule the cronjob after starting the harvests since the first harvests and workflows can take a few minutes to start. We can schedule the command to run more frequently or vice versa depending on our hardware specifications.
 
-The harvest produces a file named *inspire_harvested_data.json*. We can monitor the harvest status in the local holdingpen. However, it doesn't contain information on whether the harvested records were Core, Non-Core, or Rejected. To find this, we need to extract the list of arXiv identifiers of Core and Non-Core records from our local inspire-next instance. From the *inspirehep shell* [1]_, copy the contents of ``get_core_and_noncore_arXiv_identifiers.py`` and execute. This will produce two files, *inspire_core_list.txt* and *inspire_noncore_list.txt*. These files will be used to filter out Core and Non-Core records from the harvested data.
+The harvest produces a file named *inspire_harvested_data.json*. We can monitor the harvest status in the local holdingpen. However, it doesn't contain information on whether the harvested records were Core, Non-Core, or Rejected. To find this, we need to extract the list of arXiv identifiers of Core and Non-Core records from our local inspire-next instance. From the *inspirehep shell* [1]_, copy the contents of ``get_core_and_noncore_arXiv_identifiers.py`` and execute. This will produce two files, *inspire_core_arxiv_ids.txt* and *inspire_noncore_arxiv_ids.txt*. These files will be used to filter out Core and Non-Core records from the harvested data.
 
 Combine the Core, Non-Core, and Rejected data
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 The Core, Non-Core, and Rejected data can be combined by using the python script found at ``combine_core_noncore_rejected_data.py``. The different required files paths need to be specified in the file before running the script. Finally, this will produce the file *inspire_data.df* which is a Pandas DataFrame and which can be used for training and evaluation of the INSPIRE classifier. This file should be placed at the path specified in *inspire-classifier/inspire_classifier/config.py* in the variable *CLASSIFIER_DATAFRAME_PATH*.
 
-The resulting pandas dataframe will contain 2 columns: *labels* and *text* where *text* is *title* and *abstract* concatenated with a *<ENDTITLE>* token in between.
+The resulting pandas dataframe will contain 8 columns: *core_references_fraction_first_order*, *core_references_fraction_second_order*, *noncore_references_fraction_first_order*, *noncore_references_fraction_second_order*, *total_first_order_references*, *total_second_order_references*, *labels*, and *text* where *text* is *title* and *abstract* concatenated with a *<ENDTITLE>* token in between.
 
 
 

diff --git a/generate_training_data_scripts/combine_core_noncore_rejected_data.py b/generate_training_data_scripts/combine_core_noncore_rejected_data.py
@@ -24,16 +24,16 @@
 import numpy as np
 import pandas as pd
 
-inspire_core_list_path = 'inspire_core_list.txt'
-inspire_noncore_list_path = 'inspire_noncore_list.txt'
+inspire_core_arxiv_ids_path = 'inspire_core_arxiv_ids.txt'
+inspire_noncore_arxiv_ids_path = 'inspire_noncore_arxiv_ids.txt'
 inspire_harvested_data_path = 'inspire_harvested_data.jsonl'
 inspire_core_data_path = 'inspire_core_records.jsonl'
 inspire_noncore_data_path = 'inspire_noncore_records.jsonl'
 save_path = 'inspire_data.df'
 
-with open(inspire_core_list_path, 'r') as fd:
+with open(inspire_core_arxiv_ids_path, 'r') as fd:
     inspire_core_arxiv_ids = set(arxiv_id.strip() for arxiv_id in fd.readlines())
-with open(inspire_noncore_list_path, 'r') as fd:
+with open(inspire_noncore_arxiv_ids_path, 'r') as fd:
     inspire_noncore_arxiv_ids = set(arxiv_id.strip() for arxiv_id in fd.readlines())
 
 def rejected_data(harvested_data_path):
@@ -67,5 +67,5 @@ def noncore_data():
 
 inspire_data = pd.concat([rejected_df, noncore_df, core_df], ignore_index=True)
 inspire_data['text'] = inspire_data['title'] + ' <ENDTITLE> ' + inspire_data['abstract']
-inspire_data = inspire_data[['labels', 'text']]
+inspire_data = inspire_data.drop(['title', 'abstract'], axis=1)
 inspire_data.to_pickle(save_path)
diff --git a/generate_training_data_scripts/generate_core_and_noncore_data.py b/generate_training_data_scripts/generate_core_and_noncore_data.py
@@ -1,7 +1,7 @@
 # -*- coding: utf-8 -*-
 #
 # This file is part of INSPIRE.
-# Copyright (C) 2014-2018 CERN.
+# Copyright (C) 2014-2019 CERN.
 #
 # INSPIRE is free software: you can redistribute it and/or modify
 # it under the terms of the GNU General Public License as published by
@@ -20,12 +20,14 @@
 # granted to it by virtue of its status as an Intergovernmental Organization
 # or submit itself to any jurisdiction.
 
-"""
-Get Core and Non-Core records starting from an earliest date from INSPIRE.
-Please run the code in this snippet from within the inspirehep shell.
-"""
+from __future__ import absolute_import, division, print_function
 
 import datetime
+from inspire_dojson.utils import get_recid_from_ref
+from inspirehep.utils.record_getter import (
+    get_db_record,
+    RecordGetterError
+)
 from invenio_db import db
 from invenio_records.models import RecordMetadata
 import json
@@ -40,8 +42,75 @@
 
 
 STARTING_DATE = datetime.datetime(2016, 1, 1, 0, 0, 0)
+inspire_core_recids_path = 'inspire_core_recids.txt'
+inspire_noncore_recids_path = 'inspire_noncore_recids.txt'
+
+
+with open(inspire_core_recids_path, 'r') as fd:
+    core_recids = set(int(recid.strip()) for recid in fd.readlines())
+with open(inspire_noncore_recids_path, 'r') as fd:
+    noncore_recids = set(int(recid.strip()) for recid in fd.readlines())
+
+
+def get_first_order_core_noncore_reference_fractions(references):
+    num_core_refs = 0
+    num_noncore_refs = 0
+    if references:
+        for reference in references:
+            recid = get_recid_from_ref(reference.get('record'))
+            if recid in core_recids:
+                num_core_refs += 1
+            elif recid in noncore_recids:
+                num_noncore_refs += 1
+        total_first_order_references = len(references)
+        core_references_fraction = num_core_refs / total_first_order_references
+        noncore_references_fraction = num_noncore_refs / total_first_order_references
+    else:
+        core_references_fraction, noncore_references_fraction = 0.0, 0.0
+        total_first_order_references = 0
+
+    return core_references_fraction, noncore_references_fraction, total_first_order_references
+
+
+def get_second_order_core_noncore_reference_fractions(references):
+    num_core_refs = 0
+    num_noncore_refs = 0
+    total_second_order_references = 0
+    first_order_recids = get_references_recids(references)
+    missing_recids = set()
+    if first_order_recids:
+        for f_recid in first_order_recids:
+            if not f_recid in missing_recids:
+                try:
+                    second_order_references = get_db_record('lit', f_recid).get('references')
+                except RecordGetterError:
+                    missing_recids.add(f_recid)
+                    continue
+                if second_order_references:
+                    total_second_order_references += len(second_order_references)
+                    second_order_recids = get_references_recids(second_order_references)
+                    for s_recid in second_order_recids:
+                        if s_recid in core_recids:
+                            num_core_refs += 1
+                        elif s_recid in noncore_recids:
+                            num_noncore_refs += 1
+    if total_second_order_references > 0:
+        core_references_fraction = num_core_refs / total_second_order_references
+        noncore_references_fraction = num_noncore_refs / total_second_order_references
+    else:
+        core_references_fraction, noncore_references_fraction = 0.0, 0.0
+
+    return core_references_fraction, noncore_references_fraction, total_second_order_references
+
+
+def get_references_recids(references):
+    recids = None
+    if references:
+        recids = [get_recid_from_ref(reference.get('record')) for reference in references \
+                    if reference.get('record')]
+    return recids
 
-base_query = db.session.query(RecordMetadata).with_entities(RecordMetadata.json['titles'][0]['title'], RecordMetadata.json['abstracts'][0]['value'])
+base_query = db.session.query(RecordMetadata).with_entities(RecordMetadata.json['titles'][0]['title'], RecordMetadata.json['abstracts'][0]['value'], RecordMetadata.json['references'])
 filter_by_date = RecordMetadata.created >= STARTING_DATE
 has_title_and_abstract = and_(type_coerce(RecordMetadata.json, JSONB).has_key('titles'), type_coerce(RecordMetadata.json, JSONB).has_key('abstracts'))
 filter_deleted_records = or_(not_(type_coerce(RecordMetadata.json, JSONB).has_key('deleted')), not_(RecordMetadata.json['deleted'] == cast(True, JSONB)))
@@ -54,14 +123,30 @@
 noncore_query_results = base_query.filter(filter_by_date, only_noncore_records, has_title_and_abstract, filter_deleted_records, only_literature_collection)
 
 with open('inspire_core_records.jsonl', 'w') as fd:
-    for title, abstract in core_query_results:
+    for title, abstract, references in core_query_results:
+        core_references_fraction_first_order, noncore_references_fraction_first_order, total_first_order_references = get_first_order_core_noncore_reference_fractions(references)
+        core_references_fraction_second_order, noncore_references_fraction_second_order, total_second_order_references = get_second_order_core_noncore_reference_fractions(references)
         fd.write(json.dumps({
             'title': title,
             'abstract': abstract,
+            'core_references_fraction_first_order': core_references_fraction_first_order,
+            'noncore_references_fraction_first_order': noncore_references_fraction_first_order,
+            'core_references_fraction_second_order': core_references_fraction_second_order,
+            'noncore_references_fraction_second_order': noncore_references_fraction_second_order,
+            'total_first_order_references': total_first_order_references,
+            'total_second_order_references': total_second_order_references,
         }) + '\n')
 with open('inspire_noncore_records.jsonl', 'w') as fd:
-    for title, abstract in noncore_query_results:
+    for title, abstract, references in noncore_query_results:
+        core_references_fraction_first_order, noncore_references_fraction_first_order, total_first_order_references = get_first_order_core_noncore_reference_fractions(references)
+        core_references_fraction_second_order, noncore_references_fraction_second_order, total_second_order_references = get_second_order_core_noncore_reference_fractions(references)
         fd.write(json.dumps({
             'title': title,
             'abstract': abstract,
+            'core_references_fraction_first_order': core_references_fraction_first_order,
+            'noncore_references_fraction_first_order': noncore_references_fraction_first_order,
+            'core_references_fraction_second_order': core_references_fraction_second_order,
+            'noncore_references_fraction_second_order': noncore_references_fraction_second_order,
+            'total_first_order_references': total_first_order_references,
+            'total_second_order_references': total_second_order_references,
         }) + '\n')
diff --git a/generate_training_data_scripts/get_core_and_non_core_arxiv_identifiers.py b/generate_training_data_scripts/get_core_and_non_core_arxiv_identifiers.py
@@ -1,7 +1,7 @@
 # -*- coding: utf-8 -*-
 #
 # This file is part of INSPIRE.
-# Copyright (C) 2014-2018 CERN.
+# Copyright (C) 2014-2019 CERN.
 #
 # INSPIRE is free software: you can redistribute it and/or modify
 # it under the terms of the GNU General Public License as published by
@@ -25,13 +25,16 @@
 Please run the code in this snippet from within the inspirehep shell.
 """
 
+from __future__ import absolute_import, division, print_function
+
 from invenio_search import current_search_client as es
 from elasticsearch.helpers import scan
-import numpy as np
+
 
 core = []
 non_core = []
 
+
 for hit in scan(es, query={"query": {"exists": {"field": "arxiv_eprints"}}, "_source": ["core", "arxiv_eprints"]},
                 index='records-hep', doc_type='hep'):
     source = hit['_source']
@@ -41,7 +44,7 @@
     else:
         non_core.append(arxiv_eprint)
 
-with open('inspire_core_list.txt', 'w') as fd:
+with open('inspire_core_arxiv_ids.txt', 'w') as fd:
     fd.writelines("{}\n".format(arxiv_id) for arxiv_id in core)
-with open('inspire_noncore_list.txt', 'w') as fd:
-    fd.writelines("{}\n".format(arxiv_id) for arxiv_id in non_core)
+with open('inspire_noncore_arxiv_ids.txt', 'w') as fd:
+    fd.writelines("{}\n".format(arxiv_id) for arxiv_id in non_core)
diff --git a/generate_training_data_scripts/get_core_and_non_core_recids.py b/generate_training_data_scripts/get_core_and_non_core_recids.py
@@ -0,0 +1,45 @@
+# -*- coding: utf-8 -*-
+#
+# This file is part of INSPIRE.
+# Copyright (C) 2014-2019 CERN.
+#
+# INSPIRE is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# INSPIRE is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with INSPIRE. If not, see <http://www.gnu.org/licenses/>.
+#
+# In applying this license, CERN does not waive the privileges and immunities
+# granted to it by virtue of its status as an Intergovernmental Organization
+# or submit itself to any jurisdiction.
+
+from __future__ import absolute_import, division, print_function
+
+from invenio_search import current_search_client as es
+from elasticsearch.helpers import scan
+
+
+core = []
+non_core = []
+
+for hit in scan(es, query={"query": {"exists": {"field": "control_number"}}, "_source": ["core", "control_number"]},
+                index='records-hep', doc_type='hep'):
+    source = hit['_source']
+    control_number = source['control_number']
+    if source.get('core') == True:
+        core.append(control_number)
+    else:
+        non_core.append(control_number)
+
+with open('inspire_core_recids.txt', 'w') as fd:
+    fd.writelines("{}\n".format(recid) for recid in core)
+with open('inspire_noncore_recids.txt', 'w') as fd:
+    fd.writelines("{}\n".format(recid) for recid in non_core)
+