-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CDS: harvest directly from OAI-PMH #198
Changes from 1 commit
fad1b50
33c3ae5
db2953f
4890aa1
80efc44
adb1906
fff7c95
4895b07
b7c3fc4
bb5c834
acf9125
9a4f285
077c1f1
054aa0b
b3159f7
10804f7
5851258
23c3d90
332071f
a96f3c4
6b7d886
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,13 +10,7 @@ | |
"""Spider for the CERN Document Server OAI-PMH interface""" | ||
|
||
import logging | ||
from scrapy import Request | ||
from flask.app import Flask | ||
from harvestingkit.inspire_cds_package.from_cds import CDS2Inspire | ||
from harvestingkit.bibrecord import ( | ||
create_record as create_bibrec, | ||
record_xml_output, | ||
) | ||
from dojson.contrib.marc21.utils import create_record | ||
from inspire_dojson.hep import hep | ||
|
||
|
@@ -34,10 +28,8 @@ class CDSSpider(OAIPMHSpider): | |
$ scrapy crawl CDS \\ | ||
-a "oai_set=forINSPIRE" -a "from_date=2017-10-10" | ||
|
||
It uses `HarvestingKit <https://pypi.python.org/pypi/HarvestingKit>`_ to | ||
translate from CDS's MARCXML into INSPIRE Legacy's MARCXML flavor. It then | ||
employs `inspire-dojson <https://pypi.python.org/pypi/inspire-dojson>`_ to | ||
transform the legacy INSPIRE MARCXML into the new INSPIRE Schema. | ||
It uses `inspire-dojson <https://pypi.python.org/pypi/inspire-dojson>`_ to | ||
translate from CDS's MARCXML into the new INSPIRE Schema. | ||
""" | ||
|
||
name = 'CDS' | ||
|
@@ -57,23 +49,13 @@ def __init__(self, | |
|
||
def parse_record(self, selector): | ||
selector.remove_namespaces() | ||
try: | ||
cds_bibrec, ok, errs = create_bibrec(selector.xpath('.//record').extract()[0]) | ||
if not ok: | ||
raise RuntimeError("Cannot parse record %s: %s", selector, errs) | ||
self.logger.info("Here's the record: %s" % cds_bibrec) | ||
inspire_bibrec = CDS2Inspire(cds_bibrec).get_record() | ||
marcxml_record = record_xml_output(inspire_bibrec) | ||
record = create_record(marcxml_record) | ||
app = Flask('hepcrawl') | ||
app.config.update( | ||
self.settings.getdict('MARC_TO_HEP_SETTINGS', {}) | ||
) | ||
with app.app_context(): | ||
json_record = hep.do(record) | ||
base_uri = self.settings['SCHEMA_BASE_URI'] | ||
json_record['$schema'] = base_uri + 'hep.json' | ||
return ParsedItem(record=json_record, record_format='hep') | ||
except Exception: | ||
logger.exception("Error when parsing record") | ||
return None | ||
record = create_record(selector.xpath('.//record').extract()[0]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
app = Flask('hepcrawl') | ||
app.config.update( | ||
self.settings.getdict('MARC_TO_HEP_SETTINGS', {}) | ||
) | ||
with app.app_context(): | ||
json_record = hep.do(record) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this shouldn't use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I also added this in the last PR about arXiv, as it was failing the tests, but 👍 if we don't need it. |
||
base_uri = self.settings['SCHEMA_BASE_URI'] | ||
json_record['$schema'] = base_uri + 'hep.json' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. os.path.join |
||
return ParsedItem(record=json_record, record_format='hep') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these three lines should probably be part of the
OAIPMHSpider
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if in general one wants to always remove namespaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, the first two lines then :)