Change default EKS storageclass to use EFS, not GP2 #918

gaktive · 2020-08-04T19:34:56Z

Based on some recent observations on EKS deployments with CAP 2, we need to encourage admins to deploy with EFS as the backing storage. We've mentioned GP2 in the documentation before but there is a limit to how many PVCs can be joined to that storage unit. It also cannot span multiple zones in the way that EFS does, though we may have to be mindful that EFS has its own limits on write operations per second.

@troytop can provide more details.

jandubois · 2020-08-05T05:26:02Z

Isn't there some issue with diego-cell and EFS? Because garden tries to allocate all available disk space, minus some reserved amount, like 15GB, to the grootfs, and then there is a problem allocating a sparse file of that size?

I have some notes from Andrew/Vlad that you can work around this with something like this:

properties:
  diego-cell:
    rep:
      diego:
        executor:
          disk_capacity_mb: 40960
    garden:
      grootfs:
        reserved_space_for_other_jobs_in_mb: 8796051063807

But that feels more like a hack than a solution, so I don't think this should go into the docs as the recommended configuration.

troytop · 2020-08-05T14:39:32Z

IIUC, this work around is currently required when using EFS but using GP2 will not scale. Need details from @colstrom as to why this is so, but it has to do with the number of mounts allowed per node(?) when using GP2.

colstrom · 2020-08-05T18:27:08Z

So there's two different issues at play here, and I think that's where the confusion comes from.

The first issue is with diego-cell, where if the PVC it requests is using EFS, the "available disk space" is massive. Like petabytes massive. This causes the pre-allocation step to fail, and the workaround @jandubois described above is how we solved it. I'm in agreement that it feels very hacky. We should be able to detect this scenario and handle it gracefully. Either failing with an error that provides reasonable guidance, or (better IMO) logging a warning and handling the problem without requiring user intervention.

There are (at least) two hints we can use to detect this scenario. EBS Volumes have a maximum size of 16 TiB, and EFS presents as petabytes, so if the volume size is above some threshold, that suggests we may be in this scenario. The other hint is that EFS mounts as type nfs4. I don't think that's a reliable enough indicator on its own, but together those may be sufficient.

The other issue is that EC2 Instances have a hard upper bound on the number of EBS Volumes that can be attached to a given instance. This varies by instance type, depending on the "instance storage" included with the instance, but on most of the current generation, the cap is ~25 volumes (ish).

Things get a bit dodgy when you try to attach another volume beyond that. The EBS Volume provisions successfully, and then sits in an Attaching state until something else is detached. This gets super fun with EKS, because if an EBS Volume remains in an Attaching state for more than 30 minutes, the node has a NoSchedule taint of NodeWithImpairedVolumes=true set, which prevents other workloads from scheduling there, even if they have nothing to do with EBS Volumes. Since it never actually fails, that job won't reschedule elsewhere.

diego-cell is only going to provision one of these, so that's unlikely to cause a problem on its own. However, minibroker (by default) allocates an 8GB PVC for each service, sometimes more for HA services. If those PVCs are using EBS, you can potentially get a node into a problem state with as few as 8 HA services (1 master, 2 slaves, for a total of 3 PVCs per service), at which point no new pods can land on that node, but until that taint is applied, Kube is happy to allocate pods to the node, including minibroker pods. So basically there's a 30 minute window during which workloads can be placed in a state that will never resolve without intervention.

As far as detecting goes, the limit for a given node can be found with:

kubectl get node NODE_NAME -ojsonpath='{.status.allocatable.attachable-volumes-aws-ebs}'

And the current number of attached volumes can be found with:

kubectl get node NODE_NAME -ojson | jq -r '.status.volumesAttached[].name' | awk -F : 'BEGIN { volumes = 0 } $1 == "kubernetes.io/aws-ebs/aws" { volumes++ } END { print volumes }'

Using this information (likely obtained another way), we may be able to detect when this scenario is present, or likely to occur, and either advise the user accordingly, or bias the scheduling somehow to minimize the issue.

If minibroker is using EFS for PVCs, then this problem goes away entirely.

If there's anything I've missed here, please let me know, and I'd be happy to fill in any gaps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change default EKS storageclass to use EFS, not GP2 #918

Change default EKS storageclass to use EFS, not GP2 #918

gaktive commented Aug 4, 2020

jandubois commented Aug 5, 2020

troytop commented Aug 5, 2020

colstrom commented Aug 5, 2020

Change default EKS storageclass to use EFS, not GP2 #918

Change default EKS storageclass to use EFS, not GP2 #918

Comments

gaktive commented Aug 4, 2020

jandubois commented Aug 5, 2020

troytop commented Aug 5, 2020

colstrom commented Aug 5, 2020