Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PhysioNet-AWS Integration - Managing AWS S3 buckets and objects through PhysioNet #2086

Merged
merged 36 commits into from
Nov 29, 2023

Conversation

Chrystinne
Copy link
Contributor

This PR introduces a set of functions for managing AWS S3 buckets and objects through PhysioNet.

These functions include creating S3 client objects, determining bucket names, checking bucket existence, uploading files to buckets, configuring bucket policies based on the project's access policy, allowing the download of files using AWS command line tools, and more.

Each function is thoroughly documented with explanations of its purpose and any relevant notes. These additions enhance the project's ability to interact with AWS S3 for storage and access control.

@tompollard
Copy link
Member

Thanks @Chrystinne! Please hold off updating the requirements.txt for now because I'm about to bump a few packages, which will cause conflicts anyway.

@Chrystinne
Copy link
Contributor Author

@tompollard ok. Since you're updating it, could you also include 'boto3==1.28.33'?

tompollard added a commit that referenced this pull request Sep 22, 2023
tompollard added a commit that referenced this pull request Sep 23, 2023
Add boto3 1.28.53 and dependencies. Ref #2086
@tompollard
Copy link
Member

@Chrystinne I added the boto dependency in: #2089

Please could you run the following steps to update this pull request to include the latest changes on dev?

  1. Pull down the latest version of the dev branch (which now has boto in the requirements):
    git switch dev && git pull
  2. Switch to your feature branch:
    git switch physionet_aws_integration
  3. Remove your changes to requirements.txt by hard resetting to the penultimate commit:
    git reset 0ea7a1f3e4031766836ee0731ebab90c78792b2b --hard
  4. Rebase on the dev branch:
    git rebase dev
  5. Push, overwriting the previous changes:
    git push --force

Once this pull request is rebased, I assume #2087 can be closed.

@Chrystinne
Copy link
Contributor Author

@tompollard Thank you, Tom! I will follow those steps!

@Chrystinne Chrystinne force-pushed the physionet_aws_integration branch from 3ad7340 to 414c48f Compare September 25, 2023 14:01
@tompollard tompollard closed this Sep 25, 2023
@tompollard tompollard deleted the physionet_aws_integration branch September 25, 2023 14:13
@tompollard tompollard restored the physionet_aws_integration branch September 25, 2023 14:13
@tompollard tompollard reopened this Sep 25, 2023
@tompollard
Copy link
Member

tompollard commented Sep 25, 2023

@Chrystinne Moving our Slack discussion to this pull request, so we can gather ideas from everyone:

How do we manage secret keys on our servers? The tests on my PR are now failing because they can't find the AWS credentials. The AWS_PROFILE (that contains AWS credentials configured on my machine) is specified locally on my .env file. Do we need to add the files to the server? Or does our gitHub repo use some sort of program to receive secret keys/credentials?

The tests run independently of our servers, so adding configuration information to them won’t help. GitHub does support secret variables (basically you add a variable for the key to the tests, and then you specify the private value of the variable in GitHub), but I don’t think this is a good solution because I’m pretty sure person/s among us (@bemoody 👀) won’t trust GitHub with private information!

I haven't looked over this pull request yet (or looked at the failing tests), but I think the solution is to find an approach that allows the code to run without the need to share keys. Some ideas to consider:

  1. Only call your new functions if the AWS configuration is switched on (e.g. if AWS_PROFILE is NULL, then do not call the AWS functions). This is probably a requirement anyway because Health Data Nexus (and other platforms) may choose not to integrate with AWS.
  2. Add unit tests to the testing framework that check the behaviour of your new functions. You can use the mocker package (or whatever it is called) to simulate the API calls. We use this approach elsewhere for testing, so you should be able to find examples.
  3. Add an AWS configuration step to the GitHub workflow that sets up the AWS environment on the testing server, if possible. The nice thing about this would be that you could actually test the AWS integration, rather than just pretending to test it (like in 2).

1 will probably fix this pull request. 2 is important, so should be added shortly after we merge this pull request (or someone might argue before!). 3 would be great, but I'm unclear whether it is possible to do this without costing us money.

@bemoody
Copy link
Collaborator

bemoody commented Sep 25, 2023

Code style

Please don't add more functions to console/utility.py.

Instead, create a new file called (for example) project/cloud/s3.py, and put your new functions there.

Storing bucket metadata

For GCP, we have the class GCP in project/modelcomponents/storage.py. We'll need to define a similar class that will store similar information for S3 project mirrors.

(In particular, name of the bucket and the date/time it was created/updated.)

However, I think it would be better to define a single CloudMirror class that could be used for both GCP and S3, and eventually get rid of the GCP class.

@bemoody
Copy link
Collaborator

bemoody commented Sep 25, 2023

For reference, how boto3 obtains credentials:
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html

@bemoody
Copy link
Collaborator

bemoody commented Sep 25, 2023

https://pypi.org/project/moto/ sounds like it might be easy to use and sufficient for unit testing.

Otherwise, there appear to be quite a number of "S3-compatible" servers that could be used for testing, some of which are free software.

Copy link
Member

@tompollard tompollard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Chrystinne I took a quick look and added some suggestions. Mostly small things (e.g. imports go at the top!).

physionet-django/console/utility.py Outdated Show resolved Hide resolved
physionet-django/console/utility.py Outdated Show resolved Hide resolved
physionet-django/console/utility.py Outdated Show resolved Hide resolved
physionet-django/console/utility.py Outdated Show resolved Hide resolved
physionet-django/console/utility.py Outdated Show resolved Hide resolved
physionet-django/console/utility.py Outdated Show resolved Hide resolved
physionet-django/console/utility.py Outdated Show resolved Hide resolved
physionet-django/console/utility.py Outdated Show resolved Hide resolved
physionet-django/console/utility.py Outdated Show resolved Hide resolved
physionet-django/console/utility.py Outdated Show resolved Hide resolved
@Chrystinne
Copy link
Contributor Author

@bemoody I'm adding the functions to a new file (project/cloud/s3.py)

@Chrystinne Chrystinne force-pushed the physionet_aws_integration branch 3 times, most recently from fd1a145 to 4b35757 Compare September 28, 2023 19:38
… raising an exception if AWS_PROFILE is undefined.
…me it was created). It also allows checking whether the data is available to be downloaded from S3 using this metadata.
…ogging, directing logs to the specified target bucket and prefix. Remove unnecessary methods. Raise an exception if the S3 credentials are undefined.
…there is no AWS bucket. Check the bucket onwer before uploading a project to S3.
@Chrystinne Chrystinne force-pushed the physionet_aws_integration branch from fcf5f20 to 42542f3 Compare November 21, 2023 21:34
@bemoody bemoody merged commit 8e8ff42 into dev Nov 29, 2023
11 checks passed
@tompollard tompollard deleted the physionet_aws_integration branch November 29, 2023 20:34
This was referenced Dec 1, 2023
tompollard added a commit that referenced this pull request Dec 7, 2023
This adds a couple of test cases for the functions used to manage S3
buckets (added in pull #2086).

It uses moto as a mock server, and tests that we can "upload" a couple
of example projects, and the buckets and objects are created as we
expect.

A limitation is that this doesn't test permissions (i.e. to check that
the policy lets people download and list files.) moto has some ability
to emulate AWS IAM policies, but I'm not sure if this extends to S3
bucket policies.

I think it should also be possible to adapt this to run tests against
the *live* system. I'm not going to try that at the moment.
tompollard added a commit that referenced this pull request Jul 9, 2024
## Context

Pull #2086 added the ability to mirror project content on Amazon S3.
This is now working and we are in the process of uploading open-access
projects from PhysioNet.

The changes here will be needed once we start uploading
*restricted/credentialed* projects, so that we can securely grant access
to authorized users. (Identity verification aside, there are also some
more significant changes that are needed for handling
restricted/credentialed projects; see issue #2094.)

In brief: currently (in the old system Felipe set up), people are asked
to *self-report* their AWS account number, and *any person or service
within that account* would be allowed to access restricted data.

With these changes, in contrast, people will be asked to *verify* their
personal AWS identity; subsequently, we'll be able to grant access *only
to verified identities* (the latter part is yet to be implemented.)

## Why

DUAs for MIMIC and other databases require that data is only shared with
authorized *individuals* (each person must register on PhysioNet and be
credentialed.) We want to enable cloud access for better performance,
but complying with these DUAs requires knowing who is being granted
permission to use these cloud services.

Moreover, although each user is ultimately responsible for data
security, we want to encourage good practices. People may be using AWS
for all sorts of reasons unrelated to PhysioNet. Giving themselves
permission to access MIMIC through their personal account should not
*also* grant permission to all of those unrelated and
possibly-less-trusted services.

Some people may be using organizational AWS accounts rather than
personal ones. Maybe we want to discourage this, or maybe not, but we
can't prevent it. One member of an organization having access shouldn't
grant access to everyone in the organization.

There is a lot about AWS authentication that is still a bit mysterious
to me, but my gut feeling is that the "IAM user" level is the right
level of authentication for PhysioNet and MIMIC.

It has been suggested that we could ask people to self-report their AWS
username (or ARN?) in addition to their account number. And yes, that
would be an improvement; but it has the disadvantage that usernames are
variable-length, and may not be long-term stable. Better would be to ask
people to self-report their *AWS userid*, but that's not easy for people
to find and more likely to cause mistakes.

Finally, I can imagine that in the future there may be other reasons for
wanting to associate a PhysioNet account with an AWS account, and having
a strong verification process could enable more interesting forms of
integration.

## How identity verification works

The concept is that we would have a special-purpose S3 bucket which
allows access *only if the path matches the requester's AWS account and
userid.* To prove your identity, you generate a signature for a URL that
can only be accessed by you, and paste that signed URL into a form on
the site.

The process would be:

1. You go to your cloud settings page on PhysioNet.

2. We tell you to run the command `aws sts get-caller-identity`.

3. You copy the output into the form.

4. We then tell you to run a command like `aws s3 presign
s3://asdfghjk/physionet.org-verification/[email protected]/userid=AIDAABCDEFGHIJKL/account=112233445566/username=barackobama/`.

5. You copy the output into the form.

6. We verify the format of the URL and submit it to AWS to verify the
signature.

## Wait a minute, what's this "userid" thing you keep talking about?


https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_identifiers.html#identifiers-unique-ids

## Setup and testing

Using this feature requires creating a special-purpose S3 bucket (a
bucket which probably should not be used for anything else.)

*For the time being*, you can test this by setting
`AWS_VERIFICATION_BUCKET_NAME` to `bm-uverify-test1`. I will delete that
bucket once we've set up a permanent replacement under the PhysioNet AWS
account.

If you want to see exactly how the verification bucket is created, and
test it yourself, see the instructions in `deploy/README.md`.

## Background

Although this implementation is guided by the needs of PhysioNet, my
goal has been to design a general-purpose authentication protocol that
could be used by any site that needs to verify cross-account AWS
identities.

This is inspired in part by the technique used by Hashicorp Vault and
discussed here:
*
https://ahermosilla.com/cloud/2020/11/17/leveraging-aws-signed-requests.html
* https://www.hashicorp.com/resources/deep-dive-vault-aws-auth-backend

and similarly: https://stackoverflow.com/a/76099155

We could use the same method, but it would require the person to
download and run a small program (and that program involves some pretty
hairy digging into the AWS API.)

The method proposed here, in contrast, only requires the person to
install the official AWS CLI and run a couple of commands. I think that
this is easier to understand and therefore paradoxically more secure
(see if you can spot the security flaw in the StackOverflow answer.)

For information about *why* this works, see AWS documentation on policy
variables:

https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_variables.html

Also see the AWS CLI documentation:

https://docs.aws.amazon.com/cli/latest/reference/sts/get-caller-identity.html
https://docs.aws.amazon.com/cli/latest/reference/s3/presign.html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants