-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hepcrawl: add crawler for OSTI #276
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't have time to look at everything before going home, I'll continue tomorrow, but here are already a few comments.
hepcrawl/parsers/osti.py
Outdated
""" | ||
return [t[4:] if t.startswith(u'The ') else t for t in | ||
[c.replace(u'Collaboration', '').strip() for c in | ||
self.record.get(u'contributing_org', '').split(u';')]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stripping of the
and collaboration
is already done in the builder (which calls https://github.com/inspirehep/inspire-schemas/blob/965302b1062f1fc10a046a1ab99fcd08084b0439/inspire_schemas/utils.py#L719 ), so no need to do it here. Splitting on ;
could be added there too if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed. _RE_AND could be augmented to split on ;
in addition.
however on things like
'LUX Collaboration; Nuclear Science and Security Consortium'
the string should not be split on and
it's the only outlier I see in 3k records I just checked, though.
another problem I see with splitting at ;
in general is that it might mess up HTML escapes like &
Is there a step to clean those before splitting? are you going to augment schema utils?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is definitely mixed content with ;
and and
as separator
The DES Collaboration; The LIGO Scientific Collaboration and the Virgo Collaboration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also repetition after splitting and removing things, e.g. in
LIGO Scientific Collaboration; Virgo Collaboration; Fermi GBM; INTEGRAL; IceCube Collaboration; AstroSat Cadmium Zinc Telluride Imager Team; IPN Collaboration; The Insight-Hxmt Collaboration; ANTARES Collaboration; The Swift Collaboration; AGILE Team; The 1M2H Team; The Dark Energy Camera GW-EM Collaboration; the DES Collaboration; The DLT40 Collaboration; LIGO Scientific Collaboration and Virgo Collaboration; The Insight-HXMT Collaboration; The Dark Energy Camera GW-EM Collaboration and the DES Collaboration
are the schema/utils deduping the list ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another problem I see with splitting at
;
in general is that it might mess up HTML escapes like&
Is there a step to clean those before splitting? are you going to augment schema utils?
I don't think dealing with various text encodings and escaping schemes should be part of the schema utils honestly. The crawler should know what format it expects and convert to unescaped unicode.
are the schema/utils deduping the list ?
Not currently, but it could be added (there's utils for it in inspire_utils.dedupers
).
|
||
author_re = re.compile(r""" | ||
^(?:(?P<surname>[\w.']+(?:\s*[\w.'-]+)*)(?:\s*,\s* | ||
(?P<given_names>\w+(\s*[\w.'-]+)*))?\s* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure you actually need to parse the name here. The builder already performs name normalization, so whatever name you throw at it should work. If it doesn't work correctly, it would probably be worthwile to improve name normalization in https://github.com/inspirehep/inspire-utils/blob/master/inspire_utils/name.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the incoming data is unreliable. there are unmatched ]
or missing [
I could leave the name part alone and separate out [affiliations]
and (ORCID:1234567890123456)
either way, a firstname(s) or initial(s), lastname(s) split on a comma appears to be the best to go by
very good comments @michamos thanks |
1a8705f
to
a7bbd97
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments. I think it's important to add tests for schema validation, because some of the things you're doing seem not to be valid.
Returns: | ||
str: | ||
""" | ||
return self.__product_types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one _
is probably enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the __
is intentional here, see e.g. https://www.python-course.eu/python3_properties.php
because I want to enforce checks on setting
Returns: | ||
str: | ||
""" | ||
return self.__journal_types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
hepcrawl/parsers/osti.py
Outdated
""" | ||
return [t[4:] if t.startswith(u'The ') else t for t in | ||
[c.replace(u'Collaboration', '').strip() for c in | ||
self.record.get(u'contributing_org', '').split(u';')]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another problem I see with splitting at
;
in general is that it might mess up HTML escapes like&
Is there a step to clean those before splitting? are you going to augment schema utils?
I don't think dealing with various text encodings and escaping schemes should be part of the schema utils honestly. The crawler should know what format it expects and convert to unescaped unicode.
are the schema/utils deduping the list ?
Not currently, but it could be added (there's utils for it in inspire_utils.dedupers
).
right, I agree that schema_utils shouldn't deal with encoding issues -- which means there will be some sanitizing of random input in the crawler. It's not like the remote end serves stuff in a consistent encoding, it's random crap in the remote metadata -- so the crawler should understand the quirks of the source. on the other hand you advocate for collaboration splitting and normalization in the utils, but then there is no deduping !? So I think LiteratureBuilder should ensure deduping of lists like I don't feel strongly about |
* use API at OSTI to harvest records associated with SLAC Signed-off-by: Thorsten Schwander <[email protected]>
5b838e3
to
80d34e4
Compare
Signed-off-by: Thorsten Schwander <[email protected]>
Signed-off-by: Thorsten Schwander [email protected]
Description
This adds a LastRunSpider to crawl OSTI for records with SLAC association. The purpose is to satisfy an institutional mandate of having all SLAC HEP research represented in Inspire. Not all SLAC research output is on arXiv or other customarily harvested channels. OSTI is an additional channel to check.
Related Issue
Motivation and Context
Checklist:
RFC
and look for it).