S3: public access file/bucket structure #2093

bemoody · 2023-09-27T17:01:44Z

We want to provide mirrors of our published projects on Amazon S3 (similar to our existing Google Cloud mirrors) for purposes of providing faster and more convenient access.

Amazon has extremely low limits on the number of distinct "buckets" that can be created/owned by a single account. Solutions that have been proposed are:

Putting all open-access projects as subdirectories (prefixes) in a single bucket.
Asking AWS support to increase limits for our account.
Using multiple accounts.

Ideally we should settle on an approach before uploading many terabytes of data.

Some questions that need to be addressed:

How can we monitor usage? We want to be able to obtain aggregate statistics (such as number of requests, number of GB retrieved, and/or number of distinct visitors, over a given time period) on a per-project or per-project-version basis. Can we do this if each project is placed in its own bucket? Can we do this if each projects are in the same bucket?
How can we restrict access? We want to be able to make an entire "subdirectory" world-readable or non-world-readable (for example, if a project is under embargo, is only partially uploaded, or must be removed for legal reasons.) We want to be able to "flip a switch" - make a single call to the S3 API - that grants or revokes access for millions of files all at once. And that API call must not have any effect whatsoever on unrelated projects. Can we do this if all projects are in the same bucket?
What are the practical limitations on numbers of buckets or accounts? What is the initial soft limit on number of buckets per account? Is there also a hard limit? How difficult is it to create multiple accounts, and would this be permitted by Amazon?

bemoody · 2023-09-28T18:12:03Z

It sounds like we could possibly get away with putting all public files into one bucket and assigning a different access point (https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-points.html) for each project.

It sounds like we can maybe collect usage statistics on a per-access-point basis, but maybe this is an additional service we'd have to pay for: https://docs.aws.amazon.com/AmazonS3/latest/userguide/cloudwatch-monitoring.html

On the other hand, "Access points don't support anonymous access" (https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-points-restrictions-limitations.html). I wonder if this is really true. It'd be unfortunate if we had to issue a gazillion signed URLs just to be able to track usage.

bemoody · 2023-09-28T20:33:15Z

It sounds like we can maybe collect usage statistics on a per-access-point basis,

or also on a per-prefix basis, but

https://docs.aws.amazon.com/AmazonS3/latest/userguide/metrics-configurations.html

You can have a maximum of 1,000 metrics configurations per bucket.

so assuming that we needed one "metrics configuration" per core project, this would imply we couldn't put more than 1000 core projects in one bucket.

bemoody · 2023-09-29T18:03:14Z

How can we restrict access?

Chrystinne has pointed out that it should be possible to include path-based restrictions in the bucket policy. These restrictions would presumably be limited by the 20k cap, but that might not be a problem since they will likely be uncommon and temporary. Just need to be sure we have a clear way to manage them.

bemoody · 2023-09-29T18:18:16Z

What are the practical limitations on numbers of buckets or accounts?

https://docs.aws.amazon.com/AmazonS3/latest/userguide/BucketRestrictions.html

By default, you can create up to 100 buckets in each of your AWS accounts. If you need additional buckets, you can increase your account bucket quota to a maximum of 1,000 buckets by submitting a quota increase request.

It does not seem viable in the long term to have one published project per bucket, or even one core project per bucket, unless we would be able to use multiple accounts (and even then, that would make life difficult in many other ways.)

bemoody · 2023-12-07T20:54:10Z

Resolved in pull #2086 - public access files will be hosted in the physionet-open bucket.

This was referenced Sep 28, 2023

PhysioNet-AWS Integration - Managing AWS S3 buckets and objects through PhysioNet #2086

Merged

S3: usage monitoring #2098

Open

bemoody closed this as completed Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3: public access file/bucket structure #2093

S3: public access file/bucket structure #2093

bemoody commented Sep 27, 2023

bemoody commented Sep 28, 2023

bemoody commented Sep 28, 2023 •

edited

Loading

bemoody commented Sep 29, 2023

bemoody commented Sep 29, 2023

bemoody commented Dec 7, 2023

S3: public access file/bucket structure #2093

S3: public access file/bucket structure #2093

Comments

bemoody commented Sep 27, 2023

bemoody commented Sep 28, 2023

bemoody commented Sep 28, 2023 • edited Loading

bemoody commented Sep 29, 2023

bemoody commented Sep 29, 2023

bemoody commented Dec 7, 2023

bemoody commented Sep 28, 2023 •

edited

Loading