-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parsers: create an NLM parser #209
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
|
||
Args: | ||
nlm_records (Union[string, scrapy.selector.Selector]): records | ||
source (Optional[string]): source passed to `__init__` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please document return value
hepcrawl/parsers/nlm.py
Outdated
day = node.xpath('./Day/text()').extract_first() | ||
month = node.xpath('./Month/text()').extract_first() | ||
year = node.xpath('./Year/text()').extract_first() | ||
return PartialDate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to use PartialDate.from_parts
, which handles empty values and non-numeric months just fine:
In [1]: from inspire_utils.date import PartialDate
In [2]: PartialDate.from_parts(2017, 'Jan')
Out[2]: PartialDate(year=2017, month=1, day=None)
hepcrawl/parsers/nlm.py
Outdated
pub_type = self.root.xpath('./PublicationType/text()').extract_first() | ||
|
||
if 'Conference' in pub_type or pub_type == 'Congresses': | ||
return 'proceedings' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is conference paper
rather than proceedings, but would need to look at some examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got an example IOP update from @david-caro with a few records, but unfortunately none of them actually have the <PublicationType>
set. Maybe when we get access, there will be more records, or maybe IOP don't use the field at all... Meanwhile I found a few in this at NLM, so I think that means you are right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it. But I would not be surprised if IOP put its own values there anyway, that have nothing to do with those in the spec.
authors = self.root.xpath('./AuthorList/Author') | ||
authors_in_collaborations = self.root.xpath( | ||
'./GroupList/Group' | ||
'[GroupName/text()=../../AuthorList/Author/CollectiveName/text()]' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the purpose of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<CollectiveName>
inside the <Author>
acts as sort of a pointer to the <Group>
of the same name, where the actual people of the group are listed, like here: https://www.ncbi.nlm.nih.gov/books/NBK3828/#publisherhelp.Can_Collaborator_Names_be. So this gets all the people from groups referenced in <Authors>
. Though maybe it is too strict, now that I think about it, I don't think there is a use case for an "unreferenced" group?
return self.root.xpath('./Journal/Volume/text()').extract_first() | ||
|
||
@property | ||
def material(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PublicationType
may also contain Published Erratum
, which maps to erratum
. Don't know how this relates to the NLM field you are reading here. Maybe you should link to https://www.ncbi.nlm.nih.gov/books/NBK3828/#publisherhelp.Object_O, here or close to the NLM_OBJECT_TYPE_TO_HEP_MAP
definition.
|
||
NLM_OBJECT_TYPE_TO_HEP_MAP = { | ||
'Erratum': 'erratum', | ||
'Reprint': 'reprint', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'Republished': 'reprint'
also
Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
Check in PublicationType for `Published Erratum` too, if `<Object>` check didn't return any matches. Add references to NLM docs. Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
Signed-off-by: Szymon Łopaciuk <[email protected]>
Description
This is an implementation of a parser for the NLM format, it takes a very similar approach to the JATS parser which we already have, using LiteratureBuilder to build HEP records.
Related Issue
This is a step towards refreshing the IOP spider (#205)
Motivation and Context
IOP uses NLM format to publish their citation records. Currently the IOP spider uses web-scraping, however we will move to using OAI-PMH and this instead.
Checklist:
RFC
and look for it).