Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default EKS storageclass to use EFS, not GP2 #918

Open
gaktive opened this issue Aug 4, 2020 · 3 comments
Open

Change default EKS storageclass to use EFS, not GP2 #918

gaktive opened this issue Aug 4, 2020 · 3 comments

Comments

@gaktive
Copy link
Member

gaktive commented Aug 4, 2020

Based on some recent observations on EKS deployments with CAP 2, we need to encourage admins to deploy with EFS as the backing storage. We've mentioned GP2 in the documentation before but there is a limit to how many PVCs can be joined to that storage unit. It also cannot span multiple zones in the way that EFS does, though we may have to be mindful that EFS has its own limits on write operations per second.

@troytop can provide more details.

@jandubois
Copy link
Member

Isn't there some issue with diego-cell and EFS? Because garden tries to allocate all available disk space, minus some reserved amount, like 15GB, to the grootfs, and then there is a problem allocating a sparse file of that size?

I have some notes from Andrew/Vlad that you can work around this with something like this:

properties:
  diego-cell:
    rep:
      diego:
        executor:
          disk_capacity_mb: 40960
    garden:
      grootfs:
        reserved_space_for_other_jobs_in_mb: 8796051063807

But that feels more like a hack than a solution, so I don't think this should go into the docs as the recommended configuration.

@troytop
Copy link

troytop commented Aug 5, 2020

IIUC, this work around is currently required when using EFS but using GP2 will not scale. Need details from @colstrom as to why this is so, but it has to do with the number of mounts allowed per node(?) when using GP2.

@colstrom
Copy link
Member

colstrom commented Aug 5, 2020

So there's two different issues at play here, and I think that's where the confusion comes from.

The first issue is with diego-cell, where if the PVC it requests is using EFS, the "available disk space" is massive. Like petabytes massive. This causes the pre-allocation step to fail, and the workaround @jandubois described above is how we solved it. I'm in agreement that it feels very hacky. We should be able to detect this scenario and handle it gracefully. Either failing with an error that provides reasonable guidance, or (better IMO) logging a warning and handling the problem without requiring user intervention.

There are (at least) two hints we can use to detect this scenario. EBS Volumes have a maximum size of 16 TiB, and EFS presents as petabytes, so if the volume size is above some threshold, that suggests we may be in this scenario. The other hint is that EFS mounts as type nfs4. I don't think that's a reliable enough indicator on its own, but together those may be sufficient.

The other issue is that EC2 Instances have a hard upper bound on the number of EBS Volumes that can be attached to a given instance. This varies by instance type, depending on the "instance storage" included with the instance, but on most of the current generation, the cap is ~25 volumes (ish).

Things get a bit dodgy when you try to attach another volume beyond that. The EBS Volume provisions successfully, and then sits in an Attaching state until something else is detached. This gets super fun with EKS, because if an EBS Volume remains in an Attaching state for more than 30 minutes, the node has a NoSchedule taint of NodeWithImpairedVolumes=true set, which prevents other workloads from scheduling there, even if they have nothing to do with EBS Volumes. Since it never actually fails, that job won't reschedule elsewhere.

diego-cell is only going to provision one of these, so that's unlikely to cause a problem on its own. However, minibroker (by default) allocates an 8GB PVC for each service, sometimes more for HA services. If those PVCs are using EBS, you can potentially get a node into a problem state with as few as 8 HA services (1 master, 2 slaves, for a total of 3 PVCs per service), at which point no new pods can land on that node, but until that taint is applied, Kube is happy to allocate pods to the node, including minibroker pods. So basically there's a 30 minute window during which workloads can be placed in a state that will never resolve without intervention.

As far as detecting goes, the limit for a given node can be found with:

kubectl get node NODE_NAME -ojsonpath='{.status.allocatable.attachable-volumes-aws-ebs}'

And the current number of attached volumes can be found with:

kubectl get node NODE_NAME -ojson | jq -r '.status.volumesAttached[].name' | awk -F : 'BEGIN { volumes = 0 } $1 == "kubernetes.io/aws-ebs/aws" { volumes++ } END { print volumes }'

Using this information (likely obtained another way), we may be able to detect when this scenario is present, or likely to occur, and either advise the user accordingly, or bias the scheduling somehow to minimize the issue.

If minibroker is using EFS for PVCs, then this problem goes away entirely.

If there's anything I've missed here, please let me know, and I'd be happy to fill in any gaps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants