-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement PRA fetcher #352
base: main
Are you sure you want to change the base?
Conversation
metadata_fetcher/fetchers/Fetcher.py
Outdated
text = self.get_text_from_response(response) | ||
if settings.DATA_DEST == 'local': | ||
self.fetchtolocal(response.text) | ||
self.fetchtolocal(text) | ||
else: | ||
self.fetchtos3(response.text) | ||
self.fetchtos3(text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the metadata for each item is not fetched from the same URL that provides the paginated list of items (by reference) and two additional further HTTP requests are required to fetch the metadata for the page, an additional method was added to allow the generation of a pseudo-response that contains the paginated list with the item metadata. This method get_text_from_response
, by default, will return resolve.text
, see below.
def get_first_page_url(self): | ||
""" | ||
Two possibilities exist: | ||
|
||
1) The `original_url` contains a list of IO children and is the first page of results, or | ||
2) The `original_url` is a list of SO children, hopefully contain 1 item, and we must do more | ||
to get to the list of IO children: | ||
|
||
Fetching the first page of items requires two requests to get to it. The first, the original_url, | ||
returns a ChildrenResponse, from which the URL from the first Child is extracted. The second request returns | ||
an EntityResponse, from which the URL is extracted from AdditionalInformation/Children's text node. | ||
""" | ||
request = self.build_url_request(self.original_url) | ||
response = requests.get(**request) | ||
root = ElementTree.fromstring(response.text) | ||
|
||
# If we have IO (Information Object) children, then this is the first page. Otherwise, | ||
# we have to continue digging. | ||
io_children = root.findall(".//pra:Child[@type='IO']", self.NAMESPACES) | ||
if len(io_children) > 0: | ||
return self.original_url | ||
|
||
child_url = root.find(".//pra:Child[@type='SO']", self.NAMESPACES).text | ||
|
||
request = self.build_url_request(child_url) | ||
response = requests.get(**request) | ||
root = ElementTree.fromstring(response.text) | ||
|
||
return root.find(".//pra:AdditionalInformation/pra:Children", self.NAMESPACES).text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further discovery is required, but the original_url
, the first URL generated from the internal collection id seems to come in at least two forms, which can be see in looking at this URL for collections:
The first one lists children with the type of IO, information objects. This is a page of results, with the elements referencing the objects and their metadata.
The second lists a single child of the type SO, a structural object, which requires an additional request to get to the paginated list of items.
The fetcher currently works for both these collections, but needs to be tested against others.
def get_text_from_response(self, response): | ||
# Starting with a list of `information-objects` URLs | ||
object_url_elements = ElementTree.fromstring(response.text).findall("pra:Children/" | ||
"pra:Child", self.NAMESPACES) | ||
|
||
object_urls = [element.text for element in object_url_elements] | ||
|
||
# Getting an individual `information-object`, extracting the URL | ||
# for the oai_dc metadata fragment | ||
metadata_urls = {object_url: self.get_metadata_url_from_object(object_url) | ||
for object_url in object_urls} | ||
|
||
# Getting the metadata | ||
items = {object_url: self.get_metadata_from_url(metadata_url) | ||
for (object_url, metadata_url) in metadata_urls.items() | ||
if metadata_url is not None} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I'm gathering the metadata into a single result file, starting from the response from the request for paginated items.
output_document = response.text | ||
|
||
for search, replace in items.items(): | ||
output_document = output_document.replace(search, replace) | ||
|
||
return output_document |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not how I intend to construct to fetched page of data. Consider it more of a proof of concept.
5c13baa
to
513d645
Compare
@lthurston we checked in with Preservica re: their caveat that Basic Authentication (using the API) may not be supported long-term. They have no immediate plans to remove it, but it's evidently been deprecated for multiple releases and they reserve the right to remove at any point. Their suggestion, below -- can we update the fetcher to authenticate with a token? == The recommended way to authenticate the APIs is via the auth token, there is a API call which will return a token which can be used to make API calls. Authentication API Documentation (preservica.com) The python SDK uses these access tokens and manages the process for you. |
@aturner I can take a look, sounds like a good plan. Are we still waiting on an answer to the fetcher / mapper question? |
5664e9b
to
d833812
Compare
b68eff9
to
3bd6b7c
Compare
BLOCKED by:
This fetcher is very rough, requires more documentation, but works on the two collections I've been testing it against. It should be considered a proof of concept at this point.
I've added some comments to the code which you can see in the Files changed tab. There are two main things I wanted to call your attention to:
original_url
, is composed of the internal collection id and a hardcoded URL, and returns aChildrenResponse
, which can be composed of structural objects or information objects. The former requires additional requests to get to a list of information objects. I've provided specific examples in my code comments, alongside the code that deals with these two scenarios. The primary question I have about this is whether or not this is expected, or if there's an oddball collection in our midst.get_text_from_response
. In the Fetcher class, you'll see that this method is quite simple. In the PRA fetcher, it's a bit more involved.I'll add that because so many requests are needed to get the metadata, this fetcher is much slower than the others I've used to date. Two requests take place for each item in the result set, so 100 items can take a a minute or two.
Please take a look when you get a chance and we can discuss on Monday.