Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement PRA fetcher #352

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Implement PRA fetcher #352

wants to merge 2 commits into from

Conversation

lthurston
Copy link
Contributor

@lthurston lthurston commented Mar 22, 2023

BLOCKED by:

  • CDL: discovery regarding the varying response types
  • CDL: direction regarding number of requests per invocation

This fetcher is very rough, requires more documentation, but works on the two collections I've been testing it against. It should be considered a proof of concept at this point.

I've added some comments to the code which you can see in the Files changed tab. There are two main things I wanted to call your attention to:

  1. The first request, the original_url, is composed of the internal collection id and a hardcoded URL, and returns a ChildrenResponse, which can be composed of structural objects or information objects. The former requires additional requests to get to a list of information objects. I've provided specific examples in my code comments, alongside the code that deals with these two scenarios. The primary question I have about this is whether or not this is expected, or if there's an oddball collection in our midst.
  2. This fetcher, unlike some of the others, requires additional requests to get the item metadata from the list of paginated results. Because of this, I needed to add an additional, optional method to the Fetcher class that can be overridden in the children when necessary. I called it get_text_from_response. In the Fetcher class, you'll see that this method is quite simple. In the PRA fetcher, it's a bit more involved.

I'll add that because so many requests are needed to get the metadata, this fetcher is much slower than the others I've used to date. Two requests take place for each item in the result set, so 100 items can take a a minute or two.

Please take a look when you get a chance and we can discuss on Monday.

@lthurston lthurston changed the title Pra fetcher PRA fetcher Mar 22, 2023
@lthurston lthurston linked an issue Mar 22, 2023 that may be closed by this pull request
@lthurston lthurston added this to the #4 CIC work milestone Mar 22, 2023
@lthurston lthurston changed the base branch from main to mappers-march-1 March 23, 2023 19:19
Comment on lines 83 to 87
text = self.get_text_from_response(response)
if settings.DATA_DEST == 'local':
self.fetchtolocal(response.text)
self.fetchtolocal(text)
else:
self.fetchtos3(response.text)
self.fetchtos3(text)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the metadata for each item is not fetched from the same URL that provides the paginated list of items (by reference) and two additional further HTTP requests are required to fetch the metadata for the page, an additional method was added to allow the generation of a pseudo-response that contains the paginated list with the item metadata. This method get_text_from_response, by default, will return resolve.text, see below.

Comment on lines 33 to 61
def get_first_page_url(self):
"""
Two possibilities exist:

1) The `original_url` contains a list of IO children and is the first page of results, or
2) The `original_url` is a list of SO children, hopefully contain 1 item, and we must do more
to get to the list of IO children:

Fetching the first page of items requires two requests to get to it. The first, the original_url,
returns a ChildrenResponse, from which the URL from the first Child is extracted. The second request returns
an EntityResponse, from which the URL is extracted from AdditionalInformation/Children's text node.
"""
request = self.build_url_request(self.original_url)
response = requests.get(**request)
root = ElementTree.fromstring(response.text)

# If we have IO (Information Object) children, then this is the first page. Otherwise,
# we have to continue digging.
io_children = root.findall(".//pra:Child[@type='IO']", self.NAMESPACES)
if len(io_children) > 0:
return self.original_url

child_url = root.find(".//pra:Child[@type='SO']", self.NAMESPACES).text

request = self.build_url_request(child_url)
response = requests.get(**request)
root = ElementTree.fromstring(response.text)

return root.find(".//pra:AdditionalInformation/pra:Children", self.NAMESPACES).text
Copy link
Contributor Author

@lthurston lthurston Mar 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further discovery is required, but the original_url, the first URL generated from the internal collection id seems to come in at least two forms, which can be see in looking at this URL for collections:

26782: https://us.preservica.com/api/entity/v6.0/structural-objects/079822c0-b1fc-45df-8166-19880b12edba/children

26460: https://us.preservica.com/api/entity/v6.0/structural-objects/fd56c609-a9aa-4e8e-8745-60d1128befe0/children

The first one lists children with the type of IO, information objects. This is a page of results, with the elements referencing the objects and their metadata.

The second lists a single child of the type SO, a structural object, which requires an additional request to get to the paginated list of items.

The fetcher currently works for both these collections, but needs to be tested against others.

Comment on lines 72 to 87
def get_text_from_response(self, response):
# Starting with a list of `information-objects` URLs
object_url_elements = ElementTree.fromstring(response.text).findall("pra:Children/"
"pra:Child", self.NAMESPACES)

object_urls = [element.text for element in object_url_elements]

# Getting an individual `information-object`, extracting the URL
# for the oai_dc metadata fragment
metadata_urls = {object_url: self.get_metadata_url_from_object(object_url)
for object_url in object_urls}

# Getting the metadata
items = {object_url: self.get_metadata_from_url(metadata_url)
for (object_url, metadata_url) in metadata_urls.items()
if metadata_url is not None}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I'm gathering the metadata into a single result file, starting from the response from the request for paginated items.

Comment on lines 90 to 96
output_document = response.text

for search, replace in items.items():
output_document = output_document.replace(search, replace)

return output_document
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not how I intend to construct to fetched page of data. Consider it more of a proof of concept.

Base automatically changed from mappers-march-1 to main March 29, 2023 22:51
@lthurston lthurston changed the title PRA fetcher [WIP] Implement pra fetcher Mar 30, 2023
@lthurston lthurston changed the title [WIP] Implement pra fetcher [BLOCKED] Implement pra fetcher Apr 2, 2023
@lthurston lthurston changed the title [BLOCKED] Implement pra fetcher [BLOCKED] Implement PRA fetcher Apr 2, 2023
@lthurston lthurston changed the title [BLOCKED] Implement PRA fetcher Implement PRA fetcher Apr 3, 2023
@aturner
Copy link
Collaborator

aturner commented Jul 5, 2023

@lthurston we checked in with Preservica re: their caveat that Basic Authentication (using the API) may not be supported long-term. They have no immediate plans to remove it, but it's evidently been deprecated for multiple releases and they reserve the right to remove at any point. Their suggestion, below -- can we update the fetcher to authenticate with a token?

==

The recommended way to authenticate the APIs is via the auth token, there is a API call which will return a token which can be used to make API calls.

Authentication API Documentation (preservica.com)
https://developers.preservica.com/blog/getting-started-with-preservica-access-tokens

The python SDK uses these access tokens and manages the process for you.
https://pypreservica.readthedocs.io/en/latest/intro.html#authentication

@lthurston
Copy link
Contributor Author

@aturner I can take a look, sounds like a good plan. Are we still waiting on an answer to the fetcher / mapper question?

@christinklez christinklez linked an issue Nov 29, 2023 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PreservicaAPIMapper(Mapper) -- paused Fetcher: PRA -- paused
3 participants