Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

terraform: add AWS/EKS deployment for ChatQnA #480

Merged
merged 2 commits into from
Oct 24, 2024

Conversation

poussa
Copy link
Collaborator

@poussa poussa commented Oct 16, 2024

Description

Add AWS/EKS terraform deployment for OPEA applications. This PR add support for ChatQnA.

Followup PRs add more cloud providers and OPEA applications.

Issues

opea-project/GenAIExamples#427
opea-project/GenAIExamples#551
opea-project/GenAIExamples#550
opea-project/GenAIExamples#549
opea-project/GenAIExamples#510

Type of change

List the type of change like below. Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds new functionality)
  • Breaking change (fix or feature that would break existing design and interface)

Dependencies

None

Tests

None

@poussa poussa marked this pull request as draft October 16, 2024 17:53
@poussa poussa force-pushed the terraform-aws-eks branch 3 times, most recently from b94f5d2 to dcd3cf9 Compare October 21, 2024 14:54
@poussa poussa marked this pull request as ready for review October 21, 2024 14:54
@poussa
Copy link
Collaborator Author

poussa commented Oct 21, 2024

@mkbhanda @daisy-ycguo PTAL.

This has now been tested on AWS EKS.

@lianhao
Copy link
Collaborator

lianhao commented Oct 23, 2024

@poussa Could you pls fix this DCO issue by sing-off your commit? i.e. git commit -s

@poussa
Copy link
Collaborator Author

poussa commented Oct 23, 2024

@poussa Could you pls fix this DCO issue by sing-off your commit? i.e. git commit -s

This is now fixed.

@lianhao @mkbhanda

@poussa poussa force-pushed the terraform-aws-eks branch 2 times, most recently from b9576de to 251b461 Compare October 23, 2024 12:22
@poussa poussa force-pushed the terraform-aws-eks branch from 8b282e9 to ba5a0ef Compare October 24, 2024 07:11
Copy link
Collaborator

@mkbhanda mkbhanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor commends and a few questions. Thank you!

cloud-service-provider/aws/eks/README.MD Show resolved Hide resolved
cloud-service-provider/aws/eks/README.MD Show resolved Hide resolved
cloud-service-provider/aws/eks/README.MD Outdated Show resolved Hide resolved
cloud-service-provider/aws/eks/README.MD Outdated Show resolved Hide resolved
cloud-service-provider/aws/eks/opea-chatqna.tfvars Outdated Show resolved Hide resolved
cloud-service-provider/aws/eks/variables.tf Outdated Show resolved Hide resolved
cloud-service-provider/aws/eks/variables.tf Show resolved Hide resolved
cloud-service-provider/aws/eks/variables.tf Show resolved Hide resolved
@poussa poussa force-pushed the terraform-aws-eks branch 2 times, most recently from 3695da5 to ede5438 Compare October 24, 2024 13:32
@poussa poussa force-pushed the terraform-aws-eks branch from 1e45f2b to 6554e0d Compare October 24, 2024 13:33
@poussa
Copy link
Collaborator Author

poussa commented Oct 24, 2024

@mkbhanda thanks for the review. I fixed most of your suggestions and commented the rest.

@lucasmelogithub
Copy link

lucasmelogithub commented Oct 24, 2024

@mkbhanda

Thoughts on a folder structure that is more scalable? I also have Terraform examples and Ansible code to commit from opea-project/GenAIExamples#970

Suggestion: GenAIExamples/ChatQnA/terraform/intel/cpu/xeon/aws/eks/main.tf
This follows the docker_compose already defined structure.

We'll also have Ansible, so it would scale to GenAIExamples/ChatQnA/ansible/intel/cpu/xeon/ubuntu/single-node/playbook.yaml

@poussa
Copy link
Collaborator Author

poussa commented Oct 24, 2024

My idea of the terraform structure is:

cloud-service-provider
|--aws
    |--aks
        |---[terraform files] and <opea-app.tfvars> # instance type etc. defined in .tfvars
    |--fargate
        |---[terraform files] and <opea-app.tfvars> # instance type etc. defined in .tfvars
    |--- [other relevant AWS services]
|--gcp
    |--gke
        |---[terraform files] and <opea-app.tfvars> # instance type etc. defined in .tfvars
    |--vertex-ai
        |---[terraform files] and <opea-app.tfvars> # instance type etc. defined in .tfvars
    |--- [other relevant GCP services]
|--azure
    |--aks
        |---[terraform files] and <opea-app.tfvars> # instance type etc. defined in .tfvars
    |--- [other relevant Azure services]

This way we don't need to expose app name (e.g. chatqna), vendor name (e.g. intel), product names (e.g., xeon) or accelerator type (e.g. gaudi) in the directory structure. They are hidden in .tfvars config files or terraform command line arguments (-var foo=bar). There can be multiple .tfvars files per directory to capture multiple apps, accelerators, and cpu configurations.

NOTE: the top level directory can be in any repo but that is a separate topic.

@lucasmelogithub
Copy link

My idea of the terraform structure is:

cloud-service-provider
|--aws
    |--aks
        |---[terraform files] and <opea-app.tfvars> # instance type etc. defined in .tfvars
    |--fargate
        |---[terraform files] and <opea-app.tfvars> # instance type etc. defined in .tfvars
    |--- [other relevant AWS services]
|--gcp
    |--gke
        |---[terraform files] and <opea-app.tfvars> # instance type etc. defined in .tfvars
    |--vertex-ai
        |---[terraform files] and <opea-app.tfvars> # instance type etc. defined in .tfvars
    |--- [other relevant GCP services]
|--azure
    |--aks
        |---[terraform files] and <opea-app.tfvars> # instance type etc. defined in .tfvars
    |--- [other relevant Azure services]

This way we don't need to expose app name (e.g. chatqna), vendor name (e.g. intel), product names (e.g., xeon) or accelerator type (e.g. gaudi) in the directory structure. They are hidden in .tfvars config files or terraform command line arguments (-var foo=bar). There can be multiple .tfvars files per directory to capture multiple apps, accelerators, and cpu configurations.

NOTE: the top level directory can be in any repo but that is a separate topic.

Ok, I think we are looking to solve two different problems. I think you are going for generic infrastructure for any use case.
I would at least suggest a more flexible pattern, use GenAIInfra/scripts/terraform/<aws|azure|gcp>

variable "cluster_name" {
description = "EKS cluster name"
type = string
default = null

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A suggestion, remove the default so that an input is required and it prompts the user if missing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the region also not have a default. Alternately a default region establishes that we can find the instance type mentioned in that region.

@poussa
Copy link
Collaborator Author

poussa commented Oct 24, 2024

Ok, I think we are looking to solve two different problems. I think you are going for generic infrastructure for any use case. I would at least suggest a more flexible pattern, use GenAIInfra/scripts/terraform/<aws|azure|gcp>

terraform we should add, since there will be other deployment methods like aws/cloudformation, maybe ansible. So that should have a place in the path structure.

@poussa
Copy link
Collaborator Author

poussa commented Oct 24, 2024

Of course, this is inspired by https://github.com/NVIDIA/nim-deploy/tree/main/cloud-service-providers

@lucasmelogithub
Copy link

lucasmelogithub commented Oct 24, 2024

Ok, I think we are looking to solve two different problems. I think you are going for generic infrastructure for any use case. I would at least suggest a more flexible pattern, use GenAIInfra/scripts/terraform/<aws|azure|gcp>

terraform we should add, since there will be other deployment methods like aws/cloudformation, maybe ansible. So that should have a place in the path structure.

Right, that was the idea. And GenAIInfra/scripts... supports Ansible, Bash, etc which should not be under /cloud-service-providers/...

@mkbhanda
Copy link
Collaborator

My idea of the terraform structure is:

cloud-service-provider
|--aws
    |--aks
        |---[terraform files] and <opea-app.tfvars> # instance type etc. defined in .tfvars
    |--fargate
        |---[terraform files] and <opea-app.tfvars> # instance type etc. defined in .tfvars
    |--- [other relevant AWS services]
|--gcp
    |--gke
        |---[terraform files] and <opea-app.tfvars> # instance type etc. defined in .tfvars
    |--vertex-ai
        |---[terraform files] and <opea-app.tfvars> # instance type etc. defined in .tfvars
    |--- [other relevant GCP services]
|--azure
    |--aks
        |---[terraform files] and <opea-app.tfvars> # instance type etc. defined in .tfvars
    |--- [other relevant Azure services]

This way we don't need to expose app name (e.g. chatqna), vendor name (e.g. intel), product names (e.g., xeon) or accelerator type (e.g. gaudi) in the directory structure. They are hidden in .tfvars config files or terraform command line arguments (-var foo=bar). There can be multiple .tfvars files per directory to capture multiple apps, accelerators, and cpu configurations.

NOTE: the top level directory can be in any repo but that is a separate topic.

I am in favor of this organization for the same reasons that @poussa cites and also agree keeping terraform as part of the path to differentiate between cloud formation etc. Thank you @lucasmelogithub for reviewing and providing comments. I am OK with a default name of OPEA for the cluster, keep it easy. For a more savvy user with multiple clusters, they can change the vars file.

Copy link
Collaborator

@mkbhanda mkbhanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -0,0 +1,4 @@
cluster_name = "opea-chatqna"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just call it opea?

cloud-service-provider/aws/eks/opea-chatqna.tfvars Outdated Show resolved Hide resolved
cloud-service-provider/aws/eks/README.MD Show resolved Hide resolved
variable "cluster_name" {
description = "EKS cluster name"
type = string
default = null
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the region also not have a default. Alternately a default region establishes that we can find the instance type mentioned in that region.

@mkbhanda mkbhanda merged commit bdb9af9 into opea-project:main Oct 24, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants