[RFC] File Storage #9

fredericbarthelet · 2021-04-21T07:59:23Z

fredericbarthelet
Apr 21, 2021
Maintainer

The goal of this discussion is to get feedback on the "File Storage" component.

If you are new to Lift, a quick intro: it's a Serverless plugin that can be installed via npm and enabled in any serverless.yml file.

Here is what we are planning so far.

Use case

Providing file storage solution with S3.

Quick start

# serverless.yml

# [...]

storage:
  thumbnails:
    encryption: s3 # s3 for SSE-S3 / kms from SSE-KMS / a specific kmsKeyId for SSE-C
    archive: false # no Glacier intelligent tiering by default, only IA. Can be enabled by specifying any number of day larger than 30

This exemple will create an S3 bucket.

What's included

Encryption

There are 4 different solutions to provide encryption at rest of the data stored in S3:

client-side encryption - encrypting the file before pushing them on S3
server-side encryption - SSE-S3 - S3 manages the key to encrypt the data at rest, after its being encrypted
server-side encryption - SSE-KMS - KMS is used to generate a master key that will be used to encrypt the data at rest
server-side encryption - SSE-C - you provide your own encryption key

Criterion \ Encryption solution	Client-side	SSE-S3	SSE-KMS	SSE-C
Cost	Free	Free	💶	💶
No additional development required	❌	✅	✅	❌
Security	Depends on the implementation	🔒	🔒🔒	🔒🔒

SSE-S3 is the preferred option, activated by default, as it provides a mechanism of data encryption at rest, fully managed by AWS, at no additional cost, without requiring any additional development.

Encryption method can be updated to SSE-KMS using the encryption: kms configuration.
Encryption method can be updated to SSE-C using the encryption: <myKmsKeyId> configuration.

Versioning

Versioning enabled bucket ensures no accidental deletion or overwrite operations result in data loss.
Deleting a file will result in the addition of a delete marker.
Overwriting a file will result in the addition of a new version containing the file, without deleting the previously existing file content

Versioning comes at a cost - each version of the file is billed according to normal S3 costs.

In order to provide versioning without impacting cost, the lifecycle configuration NoncurrentVersionExpiration with a value of 30 days is added. This will automatically perform cleanup of non-current versions of file (wether it be a delete marker or a old version of a file) 30 days after its creation, giving you the opportunity to recover the file for a month before permanent deletion.

Intelligent Tiering

By default, Lift provisions an S3 bucket with an intelligent tiering configuration.
Files will be moved to IA storage class 30 days after last access.

Glacier is a special storage class requiring an interaction with the Initiate Job API before actually being able to retrive the file. Not sure if this is a good idea to implement this storage class by default, WDYT ?
Moving files to Glacier can be enabled using archive: 180 configuration (180 is the number of day without access at which point files are moved to Glacier).

Not sure yet if IA duration should be configurable as well.

Security

All lambdas defined within your serverless.yml file will be allowed the following actions by default:

GetObject
PutObject
DeleteObject

This behavior can be disabled with the following configuration:

lift:
  autoPermissions: false

deploy command will trigger a post-deploy hook that will analyse S3 lens data and give recommendation to change the IA and glacier duration.

Feedback is welcome, just post a reply to this discussion 🙏

m-radzikowski · 2021-04-23T17:15:13Z

m-radzikowski
Apr 23, 2021

Hey,
I agree the defaults for S3 bucket in CF are not great. In fact, I had a very similar plugin idea that I proposed some time ago to my work colleagues, but never actually continued with it 😄

Anyway, I know lift's objective is to provide out-of-the-box solutions, but for me, it must be also configurable to be usable. Do you plan to provide a param to change each of the options (encryption, versioning, tiering)?

Versioning

I would for sure like to change versioning settings - your defaults seem nice, but add one param to change how long old versions are kept or disable removal at all?

storage:
  thumbnails:
    versioning: 30 / true / false

where:

number - how many days old versions are kept (30 is default)
true - enable versioning, keep versions indefinately
false - disable versioning

I know I could do it all by myself with raw CF, but if I start using lift, I would like to use it for all buckets, even if one of them must have versioning disabled.

Tiering

Moving to Glacier by default is a big no-no for me. Using Glacier is a special case, not a regular one. I would rather see that as an optional feature, enabled with a param rather than disabled with one. Many developers will be surprised after using lift without reading the details, that their applications cannot simply retrieve some object from S3.

Apart from that, using Intelligent Tiering seems good by default, I think.

Auto-removal

Whenever I have S3 buckets, I need to add @purple/serverless-s3-remover sls plugin to automatically clean the bucket on sls remove. Why? Because in the CI flow, new environments are created for feature branches, e2e tests are run, objects get created, but at the end I need it to be automatically cleaned up. So I need this plugin. Otherwise, the bucket will not be removed and CF stack removal will fail.

Of course, losing all the files in the S3 bucket when you, let's say, accidentally remove the stack, is probably something that not everyone expects. I could argue that the DynamoDB does not have the same protection by default and the table will be removed even if you have some items in it.

But would be nice if the lift would take care not only for the creation of buckets but also for cleaning them in order to remove with the stack. Maybe by some additional parameter?

storage:
  thumbnails:
    clean: true # false is default

Having cleaning enabled would basically do the same as the plugin I linked above - in pre-remove hook would remove everything from the bucket.

I don't know if you will like this idea. But if so, I can help and migrate this behavior to the lift.

Naming

Btw, I assume the CF resources naming will be consistent and based on the provided name, like ThumbnailsStorage, so we can easily reference it from other parts of CF.

I really like your idea of making serverless easier, to bring the same that it does to Lambda functions also to other bits of the architecture. I hope my feedback helps, even if you disagree with it 😉 let me know if I can help, I have some experience with serverless plugins, I can help with smaller parts (as I have other things that I work on in the free time at the same time, probably as we all).

5 replies

M1ke Apr 23, 2021

I agree and +1 the point about Glacier being default opt in. The main difference with Glacier is that you can no longer just request a file. Whereas with IA you can still request instantly, you just pay for that request.

Of course an archive opt-in with default time would be fine.

You could consider IA one zone as an optional flag also - this is ok for cases where durability is less important, which could work for a lot of cases where users are uploading media to reduce costs.

mnapoli Apr 25, 2021
Maintainer

Versioning

your defaults seem nice, but add one param to change how long old versions are kept or disable removal at all?

That's interesting, let's start from the use case when thinking about new features. Why allow the option at all? What are the needs behind that?

Tiering

Moving to Glacier by default is a big no-no for me.

We've discussed exactly that too, I agree that Glacier should probably be opt-in, not the default. (for the reasons you listed)

Of course an archive opt-in with default time would be fine.

I would even question whether to ship glacier support in v1 TBH. Is glacier that common? If we aren't 100% sure about the use case, we can always add that later (and it reduces the scope for v1).

Auto-removal

The question behind that is interesting, and is more globally about retaining resources on deletion or not.

I would argue for:

in production, data should not be deleted -> that means the bucket should be retained on stack deletion
in dev/staging, Lift could indeed take care of emptying buckets before removal (that's what is already done in the "Static website" component)

Naming

so we can easily reference it from other parts of CF.

Yes good point. So far our current stance is that CF IDs are not part of the API, i.e. we can break them when we want (again, "so far" is not set in stone yet as we are still in alpha).

We want to add a new variable system to allow easy references, for example:

# serverless.yml

storage:
  thumbnails:

functions:
  thumbnailGenerator:
    handler: index.handler
    environment:
      BUCKET_NAME: ${lift:storage.thumbnails.bucketName}

WDYT?

By the way, thanks a lot for your feedback, both of you. That's the main way to help at the moment. Soon we'll also need beta testers :)

fredericbarthelet Apr 25, 2021
Maintainer Author

Hi @m-radzikowski and @M1ke. Thank you so much for your feedbacks, it's great to see interest in such feature within Lift :) !

Lift aims at making easy the development of serverless production-ready features.
With that in mind, we try to design components so that :

no configuration is required by the developper to use the component with a production viable configuration
only a few options are available to customize further the component behavior, but we'll try to limit it much more than the current available options of the function block of the Serverless framework for exemple.

If you then want to customize the component even further, 2 options are available :

using resources.extensions property of serverless.yml file to fine tune a specific CF property of a resource generated by Lift
ejecting from Lift without having to rewrite all CF -> Lift will allow the versioning of specific component CF raw file, allowing further customization

Regarding your feedbacks:

encryption: SSE-SE provides an encryption mechanism at no additional cost. When you say you'd like an option to further customize this, do you mean opting out of encryption ? Or updating to another encryption solution ? What do you think of encryption: s3 / kms / keyId // specifying keyId would switch to SSE-C ?
tiering: agreed on Glacier, you're both right, it might be too much of an overhead for a new developer. Let's switch back to Frequent Access and IA only as default, and allowing Glacier when specifying the archive option. IA-One zone is unfortunately not part of intelligent tiering.
versioning: Due to intelligent tiering, all version will at least be charged 30 days, so no point allowing for lower value than this. The use of versioning was to ensure unwanted deletion and overwrite can be recovered from, not sure at this point if we want to make it an all-time versioned object use case.
auto-removal: I was thinking of just adding the DeletionPolicy: Retain CF attribute to ensure that you at least do not receive any error while removing your CF stack. In addition, like when you use sls remove command, we would implement in Lift a post-removal hook to delete this specific bucket using the API when a specific --with-bucket option is added. Would such implementation solves your use case ?

I'll update the main post with those information. Thanks again for your feedbacks :) Really appreciate it !

m-radzikowski Apr 27, 2021

no configuration is required by the developper to use the component with a production viable configuration

Sorry, but that's not possible 😉 In the very first project in which I will try using Lift, I will need to adjust it. CF has all the options for a reason, and I agree that the defaults are poor. But there will always be a business case where your "perfect config" will not be right.

Best if it can be done easily. The easiest option for a developer is an optional param that can be set to override the defaults. If you move the "overriding" config somewhere else, we lose the fact that the bucket config is defined in a single place (=readability).

With this mindset, I would feel more comfortable using Lift knowing that I can change the versioning or encryption settings. Although I believe you propose the best defaults, and that's the benefit I see in Lift - usable defaults.

That's just my opinion 😉

I think we all agree about the Glacier.

Auto-removal

Whatever works. If it would be a param, I would just assign value (true/false) depending on the stage. If it's an additional flag, just make sure I won't need to put ten of them if Lift will have some other things in the future that will require a flag to be cleaned up.

Naming

We want to add a new variable system to allow easy references

The custom variable looks very nice. But will it support all params from GetAtt? If not, we need a way to reference it. 🙂

mnapoli Apr 27, 2021
Maintainer

But there will always be a business case where your "perfect config" will not be right.

Yes! This is exactly our approach: we want to act on use cases.

Feel free to share use cases that happened in the past (i.e. not theoretical), we are using those to choose which options to add and which features to add.

But will it support all params from GetAtt?

It will support those where we have a use case.

M1ke · 2021-04-23T18:34:27Z

M1ke
Apr 23, 2021

I guess it might need a name parameter? Given buckets are globally unique you have to pick a unique name but you'll want to refer to it locally with a simple use-based key. I'd suggest actually making name required as a field for that reason.

3 replies

m-radzikowski Apr 23, 2021

Why? For uniqueness, it's best to NOT provide a name by yourself, and let CF generate it for you as ResourceName-RandomSuffix. In CF you can reference the bucket with !Ref ResourceName etc.

M1ke Apr 24, 2021

That's a fair point for not making name required as long as the suffix is random enough. However an option to set the name would still be good because sometimes you might have a naming convention within an organisation and share bucket names across teams or applications.

mnapoli Apr 25, 2021
Maintainer

I'm not sure if we discussed that with Frederic, without manual configuration I would see 2 options.

Given:

# serverless.yml
service: mysuperapp

storage:
  thumbnails:

follow the Serverless naming convention: mysuperapp-dev-thumbnails -> super familiar to users (least amount of surprise), predictable, but prone to collisions
random name by CloudFormation -> robust, but surprising, not tidy, not good for following conventions in organizations

The 3rd obvious option is to ask the user to provide a name.

Maybe we could detect collisions before serverless deploy and fail/warn the user? That could allow best of both worlds:

good defaults (solution 1)
in case those defaults don't work we ask the user to specify a name manually (with a good error message)

afu-dev · 2021-06-18T15:31:46Z

afu-dev
Jun 18, 2021

This construct looks like a good solution to quickly deploy an "upload" bucket for users, to gather User Generated Content for example.

Maybe a parameter to autogenerate an endpoint to fetch a signed "put" request would be nice? With a possible Authorizer lambda?

In the lifecycle policy declared, should the "delete incomplete multipart uploads" be enabled by default? Maybe enabled with a parameter? (in the case of a bucket used to upload large videos)

1 reply

mnapoli Jun 18, 2021
Maintainer

To be honest this construct was imagined for private file storage.

We imagined a file uploads construct (still using S3) as a separate construct, as all its configuration, its defaults, and even the use-case it solves are different. That separate construct could deal with everything that you mention, which are features that wouldn't make sense in a bucket used for storing files privately.

Pigius · 2021-06-22T12:14:11Z

Pigius
Jun 22, 2021

I like this construct. Quickly can create a fully configurable bucket, and that's all I need in most of the scenarios ❤️ . Some example which I've created https://github.com/Pigius/s3-event-notifications-lambda-dynamodb.

0 replies

flavianh · 2021-10-21T14:47:55Z

flavianh
Oct 21, 2021

Hi @fredericbarthelet! This construct is SUPER useful! Two things may make it better IMO:

Syncing the parameters documentation with what you've written in this discussion
Clarifying if "Object Lock" can be set in the construct's parameters and if not it would be quite useful to add

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] File Storage #9

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

[RFC] File Storage #9

fredericbarthelet Apr 21, 2021 Maintainer

Use case

Quick start

What's included

Encryption

Versioning

Intelligent Tiering

Security

Replies: 5 comments · 9 replies

Versioning

Tiering

Auto-removal

Naming

mnapoli Apr 25, 2021 Maintainer

fredericbarthelet Apr 25, 2021 Maintainer Author

mnapoli Apr 27, 2021 Maintainer

mnapoli Apr 25, 2021 Maintainer

mnapoli Jun 18, 2021 Maintainer

fredericbarthelet
Apr 21, 2021
Maintainer

Replies: 5 comments 9 replies

mnapoli Apr 25, 2021
Maintainer

fredericbarthelet Apr 25, 2021
Maintainer Author

mnapoli Apr 27, 2021
Maintainer

mnapoli Apr 25, 2021
Maintainer

mnapoli Jun 18, 2021
Maintainer