Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3: public access file/bucket structure #2093

Closed
bemoody opened this issue Sep 27, 2023 · 5 comments
Closed

S3: public access file/bucket structure #2093

bemoody opened this issue Sep 27, 2023 · 5 comments

Comments

@bemoody
Copy link
Collaborator

bemoody commented Sep 27, 2023

We want to provide mirrors of our published projects on Amazon S3 (similar to our existing Google Cloud mirrors) for purposes of providing faster and more convenient access.

Amazon has extremely low limits on the number of distinct "buckets" that can be created/owned by a single account. Solutions that have been proposed are:

  1. Putting all open-access projects as subdirectories (prefixes) in a single bucket.

  2. Asking AWS support to increase limits for our account.

  3. Using multiple accounts.

Ideally we should settle on an approach before uploading many terabytes of data.

Some questions that need to be addressed:

  • How can we monitor usage? We want to be able to obtain aggregate statistics (such as number of requests, number of GB retrieved, and/or number of distinct visitors, over a given time period) on a per-project or per-project-version basis. Can we do this if each project is placed in its own bucket? Can we do this if each projects are in the same bucket?

  • How can we restrict access? We want to be able to make an entire "subdirectory" world-readable or non-world-readable (for example, if a project is under embargo, is only partially uploaded, or must be removed for legal reasons.) We want to be able to "flip a switch" - make a single call to the S3 API - that grants or revokes access for millions of files all at once. And that API call must not have any effect whatsoever on unrelated projects. Can we do this if all projects are in the same bucket?

  • What are the practical limitations on numbers of buckets or accounts? What is the initial soft limit on number of buckets per account? Is there also a hard limit? How difficult is it to create multiple accounts, and would this be permitted by Amazon?

@bemoody
Copy link
Collaborator Author

bemoody commented Sep 28, 2023

It sounds like we could possibly get away with putting all public files into one bucket and assigning a different access point (https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-points.html) for each project.

It sounds like we can maybe collect usage statistics on a per-access-point basis, but maybe this is an additional service we'd have to pay for: https://docs.aws.amazon.com/AmazonS3/latest/userguide/cloudwatch-monitoring.html

On the other hand, "Access points don't support anonymous access" (https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-points-restrictions-limitations.html). I wonder if this is really true. It'd be unfortunate if we had to issue a gazillion signed URLs just to be able to track usage.

@bemoody
Copy link
Collaborator Author

bemoody commented Sep 28, 2023

It sounds like we can maybe collect usage statistics on a per-access-point basis,

or also on a per-prefix basis, but

https://docs.aws.amazon.com/AmazonS3/latest/userguide/metrics-configurations.html

You can have a maximum of 1,000 metrics configurations per bucket.

so assuming that we needed one "metrics configuration" per core project, this would imply we couldn't put more than 1000 core projects in one bucket.

@bemoody
Copy link
Collaborator Author

bemoody commented Sep 29, 2023

How can we restrict access?

Chrystinne has pointed out that it should be possible to include path-based restrictions in the bucket policy. These restrictions would presumably be limited by the 20k cap, but that might not be a problem since they will likely be uncommon and temporary. Just need to be sure we have a clear way to manage them.

@bemoody
Copy link
Collaborator Author

bemoody commented Sep 29, 2023

What are the practical limitations on numbers of buckets or accounts?

https://docs.aws.amazon.com/AmazonS3/latest/userguide/BucketRestrictions.html

By default, you can create up to 100 buckets in each of your AWS accounts. If you need additional buckets, you can increase your account bucket quota to a maximum of 1,000 buckets by submitting a quota increase request.

It does not seem viable in the long term to have one published project per bucket, or even one core project per bucket, unless we would be able to use multiple accounts (and even then, that would make life difficult in many other ways.)

@bemoody
Copy link
Collaborator Author

bemoody commented Dec 7, 2023

Resolved in pull #2086 - public access files will be hosted in the physionet-open bucket.

@bemoody bemoody closed this as completed Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant