-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3: public access file/bucket structure #2093
Comments
It sounds like we could possibly get away with putting all public files into one bucket and assigning a different access point (https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-points.html) for each project. It sounds like we can maybe collect usage statistics on a per-access-point basis, but maybe this is an additional service we'd have to pay for: https://docs.aws.amazon.com/AmazonS3/latest/userguide/cloudwatch-monitoring.html On the other hand, "Access points don't support anonymous access" (https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-points-restrictions-limitations.html). I wonder if this is really true. It'd be unfortunate if we had to issue a gazillion signed URLs just to be able to track usage. |
or also on a per-prefix basis, but https://docs.aws.amazon.com/AmazonS3/latest/userguide/metrics-configurations.html
so assuming that we needed one "metrics configuration" per core project, this would imply we couldn't put more than 1000 core projects in one bucket. |
Chrystinne has pointed out that it should be possible to include path-based restrictions in the bucket policy. These restrictions would presumably be limited by the 20k cap, but that might not be a problem since they will likely be uncommon and temporary. Just need to be sure we have a clear way to manage them. |
https://docs.aws.amazon.com/AmazonS3/latest/userguide/BucketRestrictions.html
It does not seem viable in the long term to have one published project per bucket, or even one core project per bucket, unless we would be able to use multiple accounts (and even then, that would make life difficult in many other ways.) |
Resolved in pull #2086 - public access files will be hosted in the |
We want to provide mirrors of our published projects on Amazon S3 (similar to our existing Google Cloud mirrors) for purposes of providing faster and more convenient access.
Amazon has extremely low limits on the number of distinct "buckets" that can be created/owned by a single account. Solutions that have been proposed are:
Putting all open-access projects as subdirectories (prefixes) in a single bucket.
Asking AWS support to increase limits for our account.
Using multiple accounts.
Ideally we should settle on an approach before uploading many terabytes of data.
Some questions that need to be addressed:
How can we monitor usage? We want to be able to obtain aggregate statistics (such as number of requests, number of GB retrieved, and/or number of distinct visitors, over a given time period) on a per-project or per-project-version basis. Can we do this if each project is placed in its own bucket? Can we do this if each projects are in the same bucket?
How can we restrict access? We want to be able to make an entire "subdirectory" world-readable or non-world-readable (for example, if a project is under embargo, is only partially uploaded, or must be removed for legal reasons.) We want to be able to "flip a switch" - make a single call to the S3 API - that grants or revokes access for millions of files all at once. And that API call must not have any effect whatsoever on unrelated projects. Can we do this if all projects are in the same bucket?
What are the practical limitations on numbers of buckets or accounts? What is the initial soft limit on number of buckets per account? Is there also a hard limit? How difficult is it to create multiple accounts, and would this be permitted by Amazon?
The text was updated successfully, but these errors were encountered: