Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to harvest article by DOI #277

Open
ErnestaP opened this issue Jan 11, 2024 · 1 comment
Open

Add option to harvest article by DOI #277

ErnestaP opened this issue Jan 11, 2024 · 1 comment

Comments

@ErnestaP
Copy link

Add the option to harvest and re-harvest article by DOI. In old SCOAP3 we are facing situation quite often when we need a specific article to be harvested/re-harvested but this option is not supported. HINDAWI and APS APIs have an option to get article by DOI. More complicated situation will be implementing it for publishers which are harvested from FTP/STFTP. These articles might be just re-harvested by DOI, but not harvested by DOI since they are in zip. When we are unzipping them each article is saved in separate file

@ErnestaP
Copy link
Author

details and examples:
APIs for harvesting by doi:
You need only to pass the doi in the URL
Hindawi: https://www.hindawi.com/oai-pmh/oai.aspx?verb=getrecord&identifier=oai:hindawi.com:10.1155/2023/8127604&metadataprefix=oai_dc
APS: https://harvest.aps.org/v2/journals/articles/10.1103/PhysRevLett.131.231901

Elsevier:
Elsevier is harvested from SFTP. The files that it has there are zip and tar.
Harvesting by doi: IF THE ARTICLE IS IN SFTP, we can read the content of zip/tar and take only the article we need. It should not be difficult, since Elsevier has mapping where in zip/tar files, the articles are located. IF THE ARTICLE IS NOT IN SFTP (older zips/tars are deleted) we can re-process the articles that we already have in our s3, but for some reason not in the repo. Need to verify if the naming of saved articles reflects/can reflect the DOI.

OUP:
Is harvested from FTP. The files that it has there are zip. They should be deleted from SFTP after harvesting because OUP uploads the updates with the same names, so it means they would overwrite the old files with the changes (which could be new articles, updates of the previous articles, etc.)
Harvesting by doi: we should re-process the articles that we already have in our s3, but for some reason are not in the repo.Since the articles are already deleted from SFTP after the first harvest. If the articles were never harvested before, so they should be in SFTP. If they are not there, ask OUP to upload them

IOP:
Is harvested from SFTP. The files that it has there are zip.
Harvesting by doi: IF THE ARTICLE IS IN SFTP. As OUP it also has the all locations of files written in the mapping. However, this time the mapping is in txt. We can read the mapping and download only the articles we need. IF THE ARTICLE IS NOT IN SFTP (older zips/tars are deleted) we can re-process the articles that we already have in our s3. Need to verify if the naming of saved articles reflects/can reflect the DOI.

Springer
Is harvested from SFTP.
Harvesting by doi: doesn't have any mapping. If we don't have this article at all, we will need to harvest all the zips from Springer SFTP, which ones are not in our s3. If we have the article already in s3, but for some reason is not in the repo, we can re-process it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant