Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDS: harvest directly from OAI-PMH #198

Closed
wants to merge 21 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
fad1b50
create a OAI-PMH spider to use in CDS spider
kaplun Oct 10, 2017
33c3ae5
refactor, test contents
szymonlopaciuk Dec 7, 2017
db2953f
parse_record takes the selector
szymonlopaciuk Dec 8, 2017
4890aa1
spiders: OAI-PMH: continue where left off
szymonlopaciuk Dec 8, 2017
80efc44
use celerymonitor in CDS tests
szymonlopaciuk Dec 12, 2017
adb1906
CDS spider: drop HarvestingKit (#199)
szymonlopaciuk Dec 12, 2017
fff7c95
remove unused import
szymonlopaciuk Dec 12, 2017
4895b07
fix failure on lack of last runs file
szymonlopaciuk Dec 13, 2017
b7c3fc4
remove ignoring the exception on item validation
szymonlopaciuk Dec 13, 2017
bb5c834
style fixes
szymonlopaciuk Dec 13, 2017
acf9125
bump inspire-dojson~=57.0,>=57.1
szymonlopaciuk Dec 14, 2017
9a4f285
remove record_class field, as Record is default
szymonlopaciuk Dec 14, 2017
077c1f1
use os.path.json in cds_spider
szymonlopaciuk Dec 14, 2017
054aa0b
remove url from the last_run file hash
szymonlopaciuk Dec 14, 2017
b3159f7
remove granularity, default to YYYY-MM-DD for now
szymonlopaciuk Dec 14, 2017
10804f7
refactor tests
szymonlopaciuk Dec 14, 2017
5851258
stricter error catching when loading last_runs
szymonlopaciuk Dec 14, 2017
23c3d90
leave only a few test records, remove the rest
szymonlopaciuk Dec 14, 2017
332071f
tests: naming nad don't load directly from file
szymonlopaciuk Dec 14, 2017
a96f3c4
make parse_record abstract
szymonlopaciuk Dec 14, 2017
6b7d886
spiders: move Statetul and OAI to common module
szymonlopaciuk Jan 16, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 12 additions & 30 deletions hepcrawl/spiders/cds_spider.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,7 @@
"""Spider for the CERN Document Server OAI-PMH interface"""

import logging
from scrapy import Request
from flask.app import Flask
from harvestingkit.inspire_cds_package.from_cds import CDS2Inspire
from harvestingkit.bibrecord import (
create_record as create_bibrec,
record_xml_output,
)
from dojson.contrib.marc21.utils import create_record
from inspire_dojson.hep import hep

Expand All @@ -34,10 +28,8 @@ class CDSSpider(OAIPMHSpider):
$ scrapy crawl CDS \\
-a "oai_set=forINSPIRE" -a "from_date=2017-10-10"

It uses `HarvestingKit <https://pypi.python.org/pypi/HarvestingKit>`_ to
translate from CDS's MARCXML into INSPIRE Legacy's MARCXML flavor. It then
employs `inspire-dojson <https://pypi.python.org/pypi/inspire-dojson>`_ to
transform the legacy INSPIRE MARCXML into the new INSPIRE Schema.
It uses `inspire-dojson <https://pypi.python.org/pypi/inspire-dojson>`_ to
translate from CDS's MARCXML into the new INSPIRE Schema.
"""

name = 'CDS'
Expand All @@ -57,23 +49,13 @@ def __init__(self,

def parse_record(self, selector):
selector.remove_namespaces()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these three lines should probably be part of the OAIPMHSpider.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if in general one wants to always remove namespaces.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, the first two lines then :)

try:
cds_bibrec, ok, errs = create_bibrec(selector.xpath('.//record').extract()[0])
if not ok:
raise RuntimeError("Cannot parse record %s: %s", selector, errs)
self.logger.info("Here's the record: %s" % cds_bibrec)
inspire_bibrec = CDS2Inspire(cds_bibrec).get_record()
marcxml_record = record_xml_output(inspire_bibrec)
record = create_record(marcxml_record)
app = Flask('hepcrawl')
app.config.update(
self.settings.getdict('MARC_TO_HEP_SETTINGS', {})
)
with app.app_context():
json_record = hep.do(record)
base_uri = self.settings['SCHEMA_BASE_URI']
json_record['$schema'] = base_uri + 'hep.json'
return ParsedItem(record=json_record, record_format='hep')
except Exception:
logger.exception("Error when parsing record")
return None
record = create_record(selector.xpath('.//record').extract()[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract()[0] is equivalent to extract_first() (the former raises if the list is empty, the latter returns None but can be overridden).

app = Flask('hepcrawl')
app.config.update(
self.settings.getdict('MARC_TO_HEP_SETTINGS', {})
)
with app.app_context():
json_record = hep.do(record)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't use hep.do but the new marcxml2record API. Otherwise, the CDS conversion is not triggered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I also added this in the last PR about arXiv, as it was failing the tests, but 👍 if we don't need it.

base_uri = self.settings['SCHEMA_BASE_URI']
json_record['$schema'] = base_uri + 'hep.json'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.path.join

return ParsedItem(record=json_record, record_format='hep')
Loading