Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BrowseEverything S3 setup #173

Open
4 tasks
Tracked by #85
crisr15 opened this issue Nov 15, 2022 · 18 comments
Open
4 tasks
Tracked by #85

BrowseEverything S3 setup #173

crisr15 opened this issue Nov 15, 2022 · 18 comments
Assignees
Labels
Milestone

Comments

@crisr15
Copy link

crisr15 commented Nov 15, 2022

This is partially blocked by #207

Summary

This is established in the gem.

The 'add cloud files' button in importers is not working. When you click on it is should prompt you with options. There should also be an 'add cloud files' button in the work page under the 'add files' and 'add folder' buttons.

Image

Resources for installing browse-everything gem:

BL would like us to set up one S3 bucket that can be used for staging and production.

Accepted Criteria

  • There will be one S3 bucket per environment (we can share buckets across environments); that means all tenants use the same bucket.
  • S3 configuration added to browse everything initializer.
  • 'Add cloud files' button prompts user with options when clicked on
  • 'Add cloud files' button add to the work page under 'add files' and 'add folder' buttons

Notes

@jeremyf
Copy link
Contributor

jeremyf commented Dec 7, 2022

Rory pointed me to https://github.com/research-technologies/browse-everything/tree/sharepoint_provider, a fork that includes the SharePoint work. We'll need to discuss if this is pushed back to Samvera or if we proceed with this fork.

@jeremyf
Copy link
Contributor

jeremyf commented Dec 7, 2022

On 2022-12-07 I reached out to Jenny via Slack regarding credentials. There is ongoing conversations via email with their IT department concerning BrowseEverything and it's implementation.

@j-basford
Copy link
Collaborator

From BL Tech:

No issues with Rory’s response to my question (where’s the fedora instance hosted > on the same AWS instance where the repository runs, as far as I understand, this isn’t the BL’s but CoSectors, i.e. we don’t/can’t login to AWS to manage/configure the services, they do, but happy to be corrected – if it is our instance (registered to the BL, we pay AWS for it etc) then we need to take additional steps to secure it).

I would suggest now raising this with Jon Fryer, my suggestion would be to use a sharepoint online site to do this I just want to make sure Jon is OK with the process/principal (what they want to use, Graph, is an automated Microsoft method and standard and how we grant other 3rd parties access into our Microsoft tenant, it would just be the first time we’ve done this to a Sharepoint site). If Jon is on board, then we’ll need to discuss how we setup such a site, and making sure their access is done securely.

I’ve no issue with the gem they propose, or how you intend to use it, but essentially the data you grab using it is going to leave our control/visibility, albeit into a trusted 3rd party cloud and in bulk rather than manually as you’re doing now, so I don’t foresee any issues on the face of it.

TL;DR - Jenny will raise this with CIMU in Jan 23, sounds like BL will be OK with it.

@cziaarm cziaarm added the SL-RC Service Label: Request for change label Jan 10, 2023
@j-basford
Copy link
Collaborator

BL is OK with the Sharepoint access, but are upgrading all Sharepoint estate and are busy with that. No Technology resource til Q2 2023 so suggest we continue with S3 only and come back to SHarepoint at a later date (there may be a separate ticket for Sharepoint BE)

@j-basford
Copy link
Collaborator

To proceed with this as just with S3

@ShanaLMoore
Copy link
Contributor

@j-basford where can we get credentials to your s3?

@j-basford
Copy link
Collaborator

@cziaarm should be able to provide these.

@ShanaLMoore
Copy link
Contributor

ShanaLMoore commented Feb 28, 2023

cc @jillpe moving this out of the sprint: waiting on client decision.

The client needs to have policy and other prior discussions before implementation is done. We will mark this as a blocker until further notice.

s3 requires all of their IT to get on board too. And security questions of the s3 bucket ownership needs to be decided as well.

@ShanaLMoore ShanaLMoore removed the status in britishlibrary Feb 28, 2023
@j-basford
Copy link
Collaborator

We cannot use a BL S3 - can we use a CoSector one to confirm this works and then we will revisit after development is verified @jillpe @cziaarm

@cziaarm
Copy link
Collaborator

cziaarm commented Mar 3, 2023

Have created bucket (and a user with access keys). May need some guidance on what to add to browse_everything.yml

@ShanaLMoore
Copy link
Contributor

ShanaLMoore commented Mar 6, 2023

Have created bucket (and a user with access keys). May need some guidance on what to add to browse_everything.yml

Hi @cziaarm This example may help. Hopefully it's a plug and play type of set up.

EDIT: Oh actually, this already exists in BL. So let's try uncommenting the s3 block w the values you generated.

I also found some docs for configuring it.

@cziaarm cziaarm moved this to In Development in britishlibrary Mar 10, 2023
@cziaarm
Copy link
Collaborator

cziaarm commented Mar 10, 2023

Hi @ShanaLMoore

I have what may a useful set of value in place in the BE yaml. I'm getting an odd error that makes me feel like BE has gone wrong in constructing a url along the way (perhaps because of misconfiguration). The config looks like this:

s3:
  bucket: temp-bl-bucket-for-browse-everything.s3.amazonaws.com
  app_key: [MY_APP_KEY]
  app_secret: [MY_APP_SECRET]
  region: eu-west-1   

I've left out the response_type and expires_in options, as there is no mention of those in the docs.

I've checked out these value with a simple get via postman and they are all good.

In hyku, I'm getting the s3 option and the modal that looks like this:

Image

but when I click on "connect" I end up with an error... I don't think the error is relevant directly as I think it is the URL that is at fault:

http://bl.bl.test/concern/articles/&state=s3

Is the url that I'm ending up at and unsurprisingly causing an error

On closer inspection I can see the link for the "connect" button is:

<a class="btn btn-primary ev-auth" target="blank" id="provider_auth" href="&state=s3">Connect to S3</a>

So it feels like BE is missing something important here?

@cziaarm
Copy link
Collaborator

cziaarm commented Mar 10, 2023

So link is generated by auth_link which is not overridden fo the s3 provider, hence the lack of complete url... but I suspect that would be a link to authenticate a user for the provider. In this case my bucket has key access and that key/secret is in the config, so no need for an auth step? I'd love for someone to show me round a working s3 example, as they must exist, and I'm obviously doing something wrong

@jillpe
Copy link

jillpe commented Mar 10, 2023

Hi @cziaarm! Shana is out today, but our team has some documentation they wrote that might be helpful:

Adding Browse Everything to a Hyrax Application

Server Side Storage Support Setup

I'm also trying to find someone on the team who has experience with this and could pair with you

@cziaarm cziaarm moved this from In Development to Client QA in britishlibrary Jul 17, 2023
@cziaarm
Copy link
Collaborator

cziaarm commented Jul 18, 2023

@NoraRamsey @grahamjevon @j-basford

The S3 provider has been configured and is now available on the staging repository. You will need to configure a desktop client to be able to put things into the bucket. I have used Winscp (It is a Windows SCP/FTP client that understands S3). If you find your preferred file transfer desktop client that can use the AWSS3 protocol, then I will be able to share the access keys with you and you'll be able to use the "Add Cloud Files" feature both in the normal upload workflow and the Bulkrax import.

I'll be on slack

@cziaarm
Copy link
Collaborator

cziaarm commented Jul 25, 2023

Hi Rory, regarding BE, I uploaded a 2.5GB file today to S3. The work has appeared in the repo, but the file has yet to load in the repo. Is it possible to see if that is still loading behind the scenes or does this indicate an issue? I ran the upload a few hours ago and got the familiar Chrome error.
Incidentally, the work was duplicated (so two copies of the work have appeared and two copies of the importer are showing). I'm not sure if this was a human error (perhaps I double clicked import) or if the this was technical error. I thought I'd wait until we knew if the import was still running behind the scenes before testing this again.

@cziaarm cziaarm moved this from Client QA to In Development in britishlibrary Jul 25, 2023
@cziaarm cziaarm moved this from In Development to SoftServ QA in britishlibrary Nov 14, 2023
@grahamjevon
Copy link
Collaborator

grahamjevon commented Nov 16, 2023

Importer successfully imported work with 594MB file using BE. Everything happened as expected.

Importer with 2GB file resulted in a "504 Gateway Time-Out nginx". This message appeared about 1-2 minutes after starting the importer. When I went to the Importer history, there were two duplicate importers for this, which both said "complete":

https://bl.bl-staging.notch8.cloud/importers/122
https://bl.bl-staging.notch8.cloud/importers/123

This resulted in two works being created:

https://bl.bl-staging.notch8.cloud/concern/articles/3f2955a1-3c36-4784-a913-dc7e2790142c
https://bl.bl-staging.notch8.cloud/concern/articles/f30f3791-970a-4ddb-85ac-8931b7264159

While the filename appears in the items list, the item has no file size and it cannot be downloaded. This suggests that the upload of the file failed. This seems to replicate my experience when testing BE back in July.

@cziaarm
Copy link
Collaborator

cziaarm commented Dec 11, 2023

Key here I think is that when we use BE via Bulkrax the web does the download. This is different to when we use BE in the upload context. In that case the worker asynchronously imports the S3 URL and then it is attached to the file_set.

@cziaarm cziaarm moved this from SoftServ QA to Deploy to Staging in britishlibrary Mar 20, 2024
@cziaarm cziaarm moved this from Deploy to Staging to Client QA in britishlibrary Mar 21, 2024
@cziaarm cziaarm moved this from Client QA to Deploy to Staging in britishlibrary Mar 26, 2024
@cziaarm cziaarm moved this from Deploy to Staging to Client QA in britishlibrary Mar 27, 2024
@cziaarm cziaarm moved this from Client QA to Done in britishlibrary Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

No branches or pull requests

9 participants