Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hepcrawl: add crawler for OSTI #276

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

tsgit
Copy link
Contributor

@tsgit tsgit commented Sep 16, 2019

* use API at OSTI to harvest records associated with SLAC

Signed-off-by: Thorsten Schwander [email protected]

Description

This adds a LastRunSpider to crawl OSTI for records with SLAC association. The purpose is to satisfy an institutional mandate of having all SLAC HEP research represented in Inspire. Not all SLAC research output is on arXiv or other customarily harvested channels. OSTI is an additional channel to check.

Related Issue

Motivation and Context

Checklist:

  • [x ] I have all the information that I need (if not, move to RFC and look for it).
  • I linked the related issue(s) in the corresponding commit logs.
  • [ x] I wrote good commit log messages.
  • My code follows the code style of this project.
  • I've added any new docs if API/utils methods were added.
  • I have updated the existing documentation accordingly.
  • [x ] I have added tests to cover my changes.
  • [ x] All new and existing tests passed.

Copy link
Contributor

@michamos michamos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't have time to look at everything before going home, I'll continue tomorrow, but here are already a few comments.

hepcrawl/parsers/osti.py Outdated Show resolved Hide resolved
"""
return [t[4:] if t.startswith(u'The ') else t for t in
[c.replace(u'Collaboration', '').strip() for c in
self.record.get(u'contributing_org', '').split(u';')]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stripping of the and collaboration is already done in the builder (which calls https://github.com/inspirehep/inspire-schemas/blob/965302b1062f1fc10a046a1ab99fcd08084b0439/inspire_schemas/utils.py#L719 ), so no need to do it here. Splitting on ; could be added there too if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. _RE_AND could be augmented to split on ; in addition.
however on things like
'LUX Collaboration; Nuclear Science and Security Consortium'
the string should not be split on and

it's the only outlier I see in 3k records I just checked, though.

another problem I see with splitting at ; in general is that it might mess up HTML escapes like & Is there a step to clean those before splitting? are you going to augment schema utils?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is definitely mixed content with ; and and as separator
The DES Collaboration; The LIGO Scientific Collaboration and the Virgo Collaboration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also repetition after splitting and removing things, e.g. in

LIGO Scientific Collaboration; Virgo Collaboration; Fermi GBM; INTEGRAL; IceCube Collaboration; AstroSat Cadmium Zinc Telluride Imager Team; IPN Collaboration; The Insight-Hxmt Collaboration; ANTARES Collaboration; The Swift Collaboration; AGILE Team; The 1M2H Team; The Dark Energy Camera GW-EM Collaboration; the DES Collaboration; The DLT40 Collaboration; LIGO Scientific Collaboration and Virgo Collaboration; The Insight-HXMT Collaboration; The Dark Energy Camera GW-EM Collaboration and the DES Collaboration

are the schema/utils deduping the list ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another problem I see with splitting at ; in general is that it might mess up HTML escapes like & Is there a step to clean those before splitting? are you going to augment schema utils?

I don't think dealing with various text encodings and escaping schemes should be part of the schema utils honestly. The crawler should know what format it expects and convert to unescaped unicode.

are the schema/utils deduping the list ?

Not currently, but it could be added (there's utils for it in inspire_utils.dedupers).

hepcrawl/parsers/osti.py Outdated Show resolved Hide resolved
hepcrawl/parsers/osti.py Outdated Show resolved Hide resolved
hepcrawl/parsers/osti.py Outdated Show resolved Hide resolved
hepcrawl/parsers/osti.py Outdated Show resolved Hide resolved
hepcrawl/parsers/osti.py Show resolved Hide resolved

author_re = re.compile(r"""
^(?:(?P<surname>[\w.']+(?:\s*[\w.'-]+)*)(?:\s*,\s*
(?P<given_names>\w+(\s*[\w.'-]+)*))?\s*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure you actually need to parse the name here. The builder already performs name normalization, so whatever name you throw at it should work. If it doesn't work correctly, it would probably be worthwile to improve name normalization in https://github.com/inspirehep/inspire-utils/blob/master/inspire_utils/name.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the incoming data is unreliable. there are unmatched ] or missing [
I could leave the name part alone and separate out [affiliations] and (ORCID:1234567890123456)
either way, a firstname(s) or initial(s), lastname(s) split on a comma appears to be the best to go by

hepcrawl/parsers/osti.py Show resolved Hide resolved
@tsgit
Copy link
Contributor Author

tsgit commented Sep 16, 2019

very good comments @michamos thanks

@tsgit tsgit force-pushed the osti4 branch 2 times, most recently from 1a8705f to a7bbd97 Compare September 17, 2019 05:00
Copy link
Contributor

@michamos michamos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments. I think it's important to add tests for schema validation, because some of the things you're doing seem not to be valid.

tests/unit/test_osti_single.py Show resolved Hide resolved
hepcrawl/spiders/osti_spider.py Show resolved Hide resolved
Returns:
str:
"""
return self.__product_types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one _ is probably enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the __ is intentional here, see e.g. https://www.python-course.eu/python3_properties.php

because I want to enforce checks on setting

Returns:
str:
"""
return self.__journal_types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

hepcrawl/spiders/osti_spider.py Show resolved Hide resolved
tests/unit/test_osti_single.py Show resolved Hide resolved
hepcrawl/parsers/osti.py Outdated Show resolved Hide resolved
"""
return [t[4:] if t.startswith(u'The ') else t for t in
[c.replace(u'Collaboration', '').strip() for c in
self.record.get(u'contributing_org', '').split(u';')]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another problem I see with splitting at ; in general is that it might mess up HTML escapes like &amp; Is there a step to clean those before splitting? are you going to augment schema utils?

I don't think dealing with various text encodings and escaping schemes should be part of the schema utils honestly. The crawler should know what format it expects and convert to unescaped unicode.

are the schema/utils deduping the list ?

Not currently, but it could be added (there's utils for it in inspire_utils.dedupers).

@tsgit
Copy link
Contributor Author

tsgit commented Sep 18, 2019

right, I agree that schema_utils shouldn't deal with encoding issues -- which means there will be some sanitizing of random input in the crawler. It's not like the remote end serves stuff in a consistent encoding, it's random crap in the remote metadata -- so the crawler should understand the quirks of the source.

on the other hand you advocate for collaboration splitting and normalization in the utils, but then there is no deduping !?
if the input data has Virgo collaboration; Ligo collaboration; Virgo and Ligo collaborations then the collaborations end up replicated

So I think LiteratureBuilder should ensure deduping of lists like collection and collaborations among others. That's beyond this PR, though.

I don't feel strongly about __method vs. _method, but I did actually follow advice from some python coding resources online about encapsulation. The one I linked above isn't the one I used, but it's comparable, and I think it makes a decent argument. It'll always be a problem when encapsulation is enforced by naming convention and not by code, though.

    * use API at OSTI to harvest records associated with SLAC

Signed-off-by: Thorsten Schwander <[email protected]>
@tsgit tsgit force-pushed the osti4 branch 3 times, most recently from 5b838e3 to 80d34e4 Compare October 22, 2019 23:02
Signed-off-by: Thorsten Schwander <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants