Implement PRA fetcher #352

lthurston · 2023-03-22T23:38:20Z

BLOCKED by:

CDL: discovery regarding the varying response types
CDL: direction regarding number of requests per invocation

This fetcher is very rough, requires more documentation, but works on the two collections I've been testing it against. It should be considered a proof of concept at this point.

I've added some comments to the code which you can see in the Files changed tab. There are two main things I wanted to call your attention to:

The first request, the original_url, is composed of the internal collection id and a hardcoded URL, and returns a ChildrenResponse, which can be composed of structural objects or information objects. The former requires additional requests to get to a list of information objects. I've provided specific examples in my code comments, alongside the code that deals with these two scenarios. The primary question I have about this is whether or not this is expected, or if there's an oddball collection in our midst.
This fetcher, unlike some of the others, requires additional requests to get the item metadata from the list of paginated results. Because of this, I needed to add an additional, optional method to the Fetcher class that can be overridden in the children when necessary. I called it get_text_from_response. In the Fetcher class, you'll see that this method is quite simple. In the PRA fetcher, it's a bit more involved.

I'll add that because so many requests are needed to get the metadata, this fetcher is much slower than the others I've used to date. Two requests take place for each item in the result set, so 100 items can take a a minute or two.

Please take a look when you get a chance and we can discuss on Monday.

lthurston · 2023-03-23T19:26:52Z

metadata_fetcher/fetchers/Fetcher.py

+            text = self.get_text_from_response(response)
            if settings.DATA_DEST == 'local':
-                self.fetchtolocal(response.text)
+                self.fetchtolocal(text)
            else:
-                self.fetchtos3(response.text)
+                self.fetchtos3(text)


Because the metadata for each item is not fetched from the same URL that provides the paginated list of items (by reference) and two additional further HTTP requests are required to fetch the metadata for the page, an additional method was added to allow the generation of a pseudo-response that contains the paginated list with the item metadata. This method get_text_from_response, by default, will return resolve.text, see below.

lthurston · 2023-03-23T19:39:15Z

metadata_fetcher/fetchers/pra_fetcher.py

+    def get_first_page_url(self):
+        """
+        Two possibilities exist:
+
+        1) The `original_url` contains a list of IO children and is the first page of results, or
+        2) The `original_url` is a list of SO children, hopefully contain 1 item, and we must do more
+           to get to the list of IO children:
+
+           Fetching the first page of items requires two requests to get to it. The first, the original_url,
+           returns a ChildrenResponse, from which the URL from the first Child is extracted. The second request returns
+           an EntityResponse, from which the URL is extracted from AdditionalInformation/Children's text node.
+        """
+        request = self.build_url_request(self.original_url)
+        response = requests.get(**request)
+        root = ElementTree.fromstring(response.text)
+
+        # If we have IO (Information Object) children, then this is the first page. Otherwise,
+        # we have to continue digging.
+        io_children = root.findall(".//pra:Child[@type='IO']", self.NAMESPACES)
+        if len(io_children) > 0:
+            return self.original_url
+
+        child_url = root.find(".//pra:Child[@type='SO']", self.NAMESPACES).text
+
+        request = self.build_url_request(child_url)
+        response = requests.get(**request)
+        root = ElementTree.fromstring(response.text)
+
+        return root.find(".//pra:AdditionalInformation/pra:Children", self.NAMESPACES).text


Further discovery is required, but the original_url, the first URL generated from the internal collection id seems to come in at least two forms, which can be see in looking at this URL for collections:

26782: https://us.preservica.com/api/entity/v6.0/structural-objects/079822c0-b1fc-45df-8166-19880b12edba/children

26460: https://us.preservica.com/api/entity/v6.0/structural-objects/fd56c609-a9aa-4e8e-8745-60d1128befe0/children

The first one lists children with the type of IO, information objects. This is a page of results, with the elements referencing the objects and their metadata.

The second lists a single child of the type SO, a structural object, which requires an additional request to get to the paginated list of items.

The fetcher currently works for both these collections, but needs to be tested against others.

lthurston · 2023-03-23T19:40:28Z

metadata_fetcher/fetchers/pra_fetcher.py

+    def get_text_from_response(self, response):
+        # Starting with a list of `information-objects` URLs
+        object_url_elements = ElementTree.fromstring(response.text).findall("pra:Children/"
+                                                                            "pra:Child", self.NAMESPACES)
+
+        object_urls = [element.text for element in object_url_elements]
+
+        # Getting an individual `information-object`, extracting the URL
+        # for the oai_dc metadata fragment
+        metadata_urls = {object_url: self.get_metadata_url_from_object(object_url)
+                         for object_url in object_urls}
+
+        # Getting the metadata
+        items = {object_url: self.get_metadata_from_url(metadata_url)
+                 for (object_url, metadata_url) in metadata_urls.items()
+                 if metadata_url is not None}


Here I'm gathering the metadata into a single result file, starting from the response from the request for paginated items.

lthurston · 2023-03-23T19:41:21Z

metadata_fetcher/fetchers/pra_fetcher.py

+        output_document = response.text
+
+        for search, replace in items.items():
+            output_document = output_document.replace(search, replace)
+
+        return output_document


This is not how I intend to construct to fetched page of data. Consider it more of a proof of concept.

aturner · 2023-07-05T17:53:54Z

@lthurston we checked in with Preservica re: their caveat that Basic Authentication (using the API) may not be supported long-term. They have no immediate plans to remove it, but it's evidently been deprecated for multiple releases and they reserve the right to remove at any point. Their suggestion, below -- can we update the fetcher to authenticate with a token?

==

The recommended way to authenticate the APIs is via the auth token, there is a API call which will return a token which can be used to make API calls.

Authentication API Documentation (preservica.com)
https://developers.preservica.com/blog/getting-started-with-preservica-access-tokens

The python SDK uses these access tokens and manages the process for you.
https://pypreservica.readthedocs.io/en/latest/intro.html#authentication

lthurston · 2023-07-05T19:53:44Z

@aturner I can take a look, sounds like a good plan. Are we still waiting on an answer to the fetcher / mapper question?

lthurston changed the title ~~Pra fetcher~~ PRA fetcher Mar 22, 2023

lthurston force-pushed the pra-fetcher branch from 172b5ff to fc38059 Compare March 22, 2023 23:41

lthurston linked an issue Mar 22, 2023 that may be closed by this pull request

Fetcher: PRA -- paused #266

Open

lthurston added this to the #4 CIC work milestone Mar 22, 2023

lthurston changed the base branch from main to mappers-march-1 March 23, 2023 19:19

lthurston commented Mar 23, 2023

View reviewed changes

lthurston force-pushed the pra-fetcher branch from fc38059 to 3c6d140 Compare March 29, 2023 17:21

lthurston force-pushed the mappers-march-1 branch from 5c13baa to 513d645 Compare March 29, 2023 20:35

Base automatically changed from mappers-march-1 to main March 29, 2023 22:51

lthurston changed the title ~~PRA fetcher~~ [WIP] Implement pra fetcher Mar 30, 2023

lthurston changed the title ~~[WIP] Implement pra fetcher~~ [BLOCKED] Implement pra fetcher Apr 2, 2023

lthurston changed the title ~~[BLOCKED] Implement pra fetcher~~ [BLOCKED] Implement PRA fetcher Apr 2, 2023

lthurston added the blocked/paused label Apr 3, 2023

lthurston changed the title ~~[BLOCKED] Implement PRA fetcher~~ Implement PRA fetcher Apr 3, 2023

christinklez removed the blocked/paused label Jun 5, 2023

lthurston force-pushed the pra-fetcher branch from 3c6d140 to e77248d Compare June 12, 2023 18:40

lthurston force-pushed the pra-fetcher branch from acf7b5b to 5a3867b Compare June 19, 2023 20:05

lthurston added 2 commits June 21, 2023 07:41

Implement PRA fetcher

30d6065

Implement PRA mapper

b5cd348

lthurston force-pushed the pra-fetcher branch from 8d1a648 to b5cd348 Compare June 21, 2023 14:42

christinklez added the blocked/paused label Jun 22, 2023

lthurston self-assigned this Jul 5, 2023

christinklez removed the blocked/paused label Oct 9, 2023

christinklez unassigned lthurston Nov 29, 2023

christinklez linked an issue Nov 29, 2023 that may be closed by this pull request

PreservicaAPIMapper(Mapper) -- paused #267

Open

christinklez modified the milestones: #4 CIC work, Mappers & Fetchers (wrap up post-mvp) Feb 5, 2024

amywieliczka force-pushed the main branch 5 times, most recently from 5664e9b to d833812 Compare February 14, 2024 16:50

amywieliczka force-pushed the main branch 10 times, most recently from b68eff9 to 3bd6b7c Compare March 19, 2024 17:22

amywieliczka force-pushed the main branch from e2b4fa6 to b7d8ce3 Compare October 1, 2024 20:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement PRA fetcher #352

Implement PRA fetcher #352

lthurston commented Mar 22, 2023 •

edited

Loading

lthurston Mar 23, 2023

lthurston Mar 23, 2023 •

edited

Loading

lthurston Mar 23, 2023

lthurston Mar 23, 2023

aturner commented Jul 5, 2023

lthurston commented Jul 5, 2023

Implement PRA fetcher #352

Are you sure you want to change the base?

Implement PRA fetcher #352

Conversation

lthurston commented Mar 22, 2023 • edited Loading

lthurston Mar 23, 2023

Choose a reason for hiding this comment

lthurston Mar 23, 2023 • edited Loading

Choose a reason for hiding this comment

lthurston Mar 23, 2023

Choose a reason for hiding this comment

lthurston Mar 23, 2023

Choose a reason for hiding this comment

aturner commented Jul 5, 2023

lthurston commented Jul 5, 2023

lthurston commented Mar 22, 2023 •

edited

Loading

lthurston Mar 23, 2023 •

edited

Loading