aws-ia · boringgeek · Dec 28, 2024 · Dec 28, 2024
@@ -61,7 +61,7 @@ def get_prompt_template(retrieved_passages: List[str]) -> str:
     return f"""
     You are an AI assistant answering questions about AWS re:Invent 2024 session information and general queries.
     Your task is to analyze the user's question, categorize it, and provide an appropriate response.
-    Every session information includes the following fields:    
+    Every session information includes the following fields:
     - Title
     - Session Code
     - Description
@@ -83,12 +83,12 @@ def get_prompt_template(retrieved_passages: List[str]) -> str:
     2. REINVENT_INFORMATION
         - This question type is for questions requesting information about specific sessions
         - This question type is for questions requesting information about sessions held at certain venue and times
-        
+
     3. REINVENT_RECOMMENDATION
         - This question type is for session recommendation questions for specific topics or interests
 
-    Second, analyze the user's question and provide a response based on the question type. 
-    
+    Second, analyze the user's question and provide a response based on the question type.
+
     1. GENERAL
         - Ignore the content in the retrieved passages.
         - Provide a direct answer to the question based on your general knowledge.
@@ -117,12 +117,12 @@ def get_prompt_template(retrieved_passages: List[str]) -> str:
 
     IMPORTANT:
     - Always base your answers on the provided data and refrain from offering uncertain information.
-    - Your final response should only contain the actual answer to the user's question. 
-    - Do not include any explanation of your thought process, categorization, or analysis in the final response. 
+    - Your final response should only contain the actual answer to the user's question.
+    - Do not include any explanation of your thought process, categorization, or analysis in the final response.
     - If retrieved passages are empty and question type is not GENERAL, respond with "Sorry. I couldn't find any related information."
     - Do not modify fields data in the retrieved passages.
     - If all conditions are not met, recommend similar sessions and be sure to explain the reason.
-    
+
     CRITICAL RESPONSE FORMAT:
     - You MUST format your entire response EXACTLY as follows, with no exceptions:
 
@@ -147,7 +147,7 @@ def get_prompt_template(retrieved_passages: List[str]) -> str:
     [/QUESTION_TYPE]
     [RESPONSE]
     Based on your question, I recommend the following session:
-    
+
     1. Responsible generative AI tabletop: Governance and oversight [REPEAT]
         - Session Code: GHJ208-R1
         - Session Type: Gamified learning
@@ -195,14 +195,14 @@ def main():
         if not knowledge_base_id:
             st.info("Something is wrong with parameter store")
             st.stop()
-        
+
         agent_client = boto3.client('bedrock-agent-runtime')
         bedrock_runtime_client = boto3.client('bedrock-runtime')
 
         try:
             # Retrieve relevant passages from the knowledge base
             retrieved_results = retrieve_from_knowledge_base(agent_client, knowledge_base_id, prompt)
-            
+
             # Extract and format the retrieved passages
             retrieved_passages = [result['content']['text'] for result in retrieved_results]
             formatted_passages = "\n\n".join(f"Passage {i+1}:\n{passage}" for i, passage in enumerate(retrieved_passages))
@@ -213,7 +213,7 @@ def main():
             response_started = False
 
             message_placeholder = st.chat_message("assistant").empty()
-    
+
             # Generate the final response using the invoke_model API
             system_prompt = get_prompt_template(formatted_passages)
 
@@ -230,7 +230,7 @@ def main():
                 elif response_started:
                     # If we're past the [RESPONSE] tag, continue accumulating the response content
                     response_content += chunk
-                
+
                 # Display the response content if we've started collecting it
                 if response_started:
                     # Remove the [/RESPONSE] tag if present and display the content
@@ -240,14 +240,14 @@ def main():
             import re
             question_type_match = re.search(r'\[QUESTION_TYPE\](.*?)\[/QUESTION_TYPE\]', full_response, re.DOTALL)
             response_match = re.search(r'\[RESPONSE\](.*?)\[/RESPONSE\]', full_response, re.DOTALL)
-            
+
             question_type = question_type_match.group(1).strip() if question_type_match else "UNKNOWN"
             final_response = response_match.group(1).strip() if response_match else "I apologize. There was an issue generating an appropriate response."
 
             message_placeholder.markdown(final_response)
-            
+
             st.session_state.messages.append({"role": "assistant", "content": final_response})
-                
+
             # Display citations only for non-general questions
             if question_type not in ["GENERAL", "UNKNOWN"]:
                 with st.expander("Data Sources"):
@@ -264,10 +264,10 @@ def main():
                 st.warning("Session has expired. Starting a new session. Please enter your question again.")
             else:
                 st.error("An error occurred while processing the response. Please check the logs for details.")
-            
+
             msg = "I encountered an issue while processing the response. Could you please rephrase your prompt or try a different question?"
             st.session_state.messages.append({"role": "assistant", "content": msg})
             st.chat_message("assistant").write(msg)
 
 if __name__ == "__main__":
-    main()
+    main()
@@ -46,7 +46,7 @@ def __init__(
         self.knowledge_base_data_source = bedrock.S3DataSource(self, 'KnowledgeBaseDataSource',
             bucket=self.bucket,
             knowledge_base=self.knowledge_base,
-            data_source_name='ReinventSessionInformationText', 
+            data_source_name='ReinventSessionInformationText',
             chunking_strategy= bedrock.ChunkingStrategy.hierarchical(
                 overlap_tokens=60,
                 max_parent_token_size=1500,

@@ -0,0 +1,19 @@
+# default base image
+FROM public.ecr.aws/neuron/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.20.0-ubuntu20.04
+# Clone the vllm repository
+RUN git clone https://github.com/vllm-project/vllm.git
+# Set the working directory
+WORKDIR /vllm
+RUN git checkout v0.6.0
+# Set the environment variable
+ENV VLLM_TARGET_DEVICE=neuron
+# Install the dependencies
+RUN python3 -m pip install -U -r requirements-neuron.txt
+RUN python3 -m pip install .
+# Modify the arg_utils.py file to support larger block_size option
+RUN sed -i "/parser.add_argument('--block-size',/ {N;N;N;N;N;s/\[8, 16, 32\]/[8, 16, 32, 128, 256, 512, 1024, 2048, 4096, 8192]/}" vllm/engine/arg_utils.py
+# Install ray
+RUN python3 -m pip install ray
+RUN pip install -U  triton>=3.0.0
+# Set the entry point
+ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]
@@ -0,0 +1,250 @@
+# ECS machine learning distributed training
+
+This solution blueprint creates the infrastructure needed to run GenAI inference using [vLLM](https://docs.vllm.ai/en/latest/index.html) with [AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) and Inferentia 2 instances.  This solution is based on similar examples for running inference using vLLM on [EKS](https://aws.amazon.com/blogs/machine-learning/deploy-meta-llama-3-1-8b-on-aws-inferentia-using-amazon-eks-and-vllm/) and [EC2](https://aws.amazon.com/blogs/machine-learning/serving-llms-using-vllm-and-amazon-ec2-instances-with-aws-ai-chips/) using Inferentia-based instances.
+
+> Insert Diagram here
+
+By default, this blueprint deploys inf2.8xlarge instances optimized for GenAI inference workloads. The setup is tailored for running vLLM with pre-compiled Neuron-compatible models. You can modify the instance type and resource allocation by changing the variables in the Terraform configuration.
+
+## Components
+
+*	ECS Cluster:
+    *	Uses an autoscaling group to provision inf2 instances for the ECS cluster.
+    *	Allows dynamic scaling of GenAI workloads.
+*	ECS Service Definition:
+    *	vLLM Service: Configured to serve requests for GenAI inference using vLLM.
+*	Application Load Balancer:
+    *	Exposes the vLLM inference service endpoint to clients.
+    *	Configured with a target group and health checks for monitoring service availability.
+*	CloudWatch Logs:
+    *	Logs from ECS tasks and services are collected in CloudWatch for monitoring and debugging.
+
+
+## Prequequisites
+
+### Hugging Face Account and API Key
+
+To use the meta-llama/Llama-3.2-1B model within the blueprint, you’ll need a Hugging Face account and and an API key to access to the model. Follow these steps to set these up:
+
+1.	[Sign up for a Hugging Face account](https://huggingface.co/join) if you don’t already have one.
+2.	Go to the [meta-llama/Llama-3.2-1B model card](https://huggingface.co/meta-llama/Llama-3.2-1B) on Hugging Face.
+3.	Agree to the model license to gain access.
+4.	Generate your Hugging Face API key:
+	*	Navigate to your [Hugging Face Account Settings](https://huggingface.co/settings/tokens).
+	*	Under the Access Tokens section, click New Token.
+	*	Provide a name for your token and set the role to write or read.
+	* Copy the token when prompted (as shown in the following figure). The token will not be displayed again.
+
+## Preparing the Docker Image
+
+To run the model, you’ll need to build and push a Docker image with the required dependencies to Amazon Elastic Container Registry (Amazon ECR). While [you can use docker buildx](https://docs.docker.com/build/building/multi-platform/) to do this, if you dont have your local machine configured for this, you can use an Inf2-based EC2 instance as a build environment to build your container for the arm64 architecture.
+
+### Steps to launch an Inf2-based Build Environment:
+
+#### 1. Launch an Inf2-based EC2 Instance
+1. Open the AWS Management Console and launch an Inf2-based EC2 instance (e.g., inf2.8xlarge). You can use a [guide like this](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-launch-instance-wizard.html). If this is your first time using inf/trn instances, you will need to [request a quota increase](https://repost.aws/articles/ARgmEMvbR6Re200FQs8rTduA/inferentia-and-trainium-service-quotas).
+2. Ensure the instance has:
+	  * Access to your [Amazon ECR repository](https://docs.aws.amazon.com/AmazonECR/latest/userguide/image-push-iam.html).
+	  * Permissions for Docker and AWS CLI operations.
+	  * Can be accessed via Session Manager or is [configured for SSH access](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/connection-prereqs-general.html)
+3.	Access the instance through Session manager or SSH into the EC2 instance using the following command:
+
+```bash
+ssh -i your-key.pem ec2-user@<ec2-public-ip>
+```
+
+#### 2. Setup Environmental Variables
+
+```bash
+export ECR_REPO_NAME=vllm-neuron
+export AWS_REGION=us-west-2
+export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
+```
+
+#### 3. Create an ECR Repository
+Run the following command to create an ECR repository:
+
+```bash
+aws ecr create-repository --repository-name $ECR_REPO_NAME --region $AWS_REGION
+```
+
+#### 4. Create the Dockerfile
+
+> If you're using your local development machine, you can skip this step as a Dockerfile already exists in this project.
+
+Create the Dockerfile for the VLLM model:
+```bash
+cat > Dockerfile <<EOF
+# default base image
+FROM public.ecr.aws/neuron/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.20.0-ubuntu20.04
+# Clone the vllm repository
+RUN git clone https://github.com/vllm-project/vllm.git
+# Set the working directory
+WORKDIR /vllm
+RUN git checkout v0.6.0
+# Set the environment variable
+ENV VLLM_TARGET_DEVICE=neuron
+# Install the dependencies
+RUN python3 -m pip install -U -r requirements-neuron.txt
+RUN python3 -m pip install .
+# Modify the arg_utils.py file to support larger block_size option
+RUN sed -i "/parser.add_argument('--block-size',/ {N;N;N;N;N;s/\[8, 16, 32\]/[8, 16, 32, 128, 256, 512, 1024, 2048, 4096, 8192]/}" vllm/engine/arg_utils.py
+# Install ray
+RUN python3 -m pip install ray
+RUN pip install -U  triton>=3.0.0
+# Set the entry point
+ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]
+EOF
+```
+
+#### 5. Build and Push the Docker Image
+
+Run the following commands to build and push the Docker image:
+
+1.	Authenticate Docker to your ECR registry:
+
+```bash
+aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com
+```
+
+2. 	Build the Docker image:
+
+```bash
+docker build -t ${ECR_REPO_NAME}:latest .
+```
+
+3. Tag the image
+
+```bash
+docker tag ${ECR_REPO_NAME}:latest $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/${ECR_REPO_NAME}:latest
+```
+
+4. Push the image to ECR
+
+```bash
+docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/${ECR_REPO_NAME}:latest
+```
+
+5. Copy the ECR image URI for your use in the main.tf file within this project.
+
+```bash
+echo "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/${ECR_REPO_NAME}:latest"
+```
+
+## Deployment Prerequisites
+
+1. Modify the local variables at line 6 of `main.tf`
+```nano
+  name                    = "ecs-demo-vllm-inferentia"    # Defaul name of the project
+  region                  = "us-west-2"                   # Default region
+  instance_type           = "inf2.8xlarge"                # Default instance size
+  vllm_container_image    = "<ECR IMAGE URI>"             # ECR Image URI you created when building and pushing your image
+  hugging_face_api_key    = "<YOUR HUGGIN FACE API KEY>"  # Your Hugging Face API Key
+```
+
+
+## Deployment
+
+1. Deploy core-infra resources
+
+```shell
+cd ./terraform/ec2-examples/core-infra
+terraform init
+terraform apply -target=module.vpc -target=aws_service_discovery_private_dns_namespace.this
+```
+
+2. Deploy this blueprint
+
+```shell
+cd ../vllm-inferentia
+terraform init
+terraform apply
+```
+
+## Example: Running GenAI Inference
+
+Once the cluster and services are deployed, you can use the load balancer DNS name (output during the deployment) to send requests to the vLLM service.
+
+
+Send a POST request to the vLLM OpenAI-compatible endpoint:
+```bash
+curl -X POST http://<ALB_DNS_NAME>:8000/v1/completions \
+-H "Content-Type: application/json" \
+-d '{
+  "model": "meta-llama/Llama-3.2-1B",
+  "prompt": "Write a short poem about technology",
+  "max_tokens": 100,
+  "temperature": 0.7
+}'
+```
+
+Example Response:
+```json
+{
+  "id": "cmpl-6ze...",
+  "object": "text_completion",
+  "created": 1680307267,
+  "model": "meta-llama/Llama-3.2-1B",
+  "choices": [
+    {
+      "text": "\n\nTechnology, a wondrous art,\nA force that shapes the world's heart.\nIn circuits small and data vast,\nIt links the future to the past.",
+      "index": 0,
+      "logprobs": null,
+      "finish_reason": "stop"
+    }
+  ]
+}
+```
+## What do you do next?
+
+Congratulations on successfully deploying your vLLM inference solution on ECS with AWS Inferentia! Here are some ideas to take your implementation to the next level:
+
+1. Explore Frontend Integrations
+	* Gradio: Use [Gradio](https://www.gradio.app/) to create an interactive web interface for your model, allowing users to test various prompts and visualize responses.
+	* OpenWebUI: Deploy [OpenWebUI](https://github.com/open-webui/open-webui) as an additional frontend for your model, providing a user-friendly way to interact with the OpenAI-compatible endpoint.
+
+2. Build Custom Python Applications
+	*	Create Python scripts or applications that integrate your inference service to solve real-world problems:
+	*	Automate customer support chatbots.
+	*	Generate summaries, translations, or other natural language tasks.
+	*	Build a personalized AI assistant tailored to your organizational needs.
+
+3. Integrate with Existing Workflows
+	*	Serverless Architectures: Use AWS Lambda or Step Functions to trigger and process model inference requests in response to specific events.
+	*	Data Pipelines: Integrate the model into your data pipelines for real-time predictions or insights, such as tagging or categorizing documents automatically.
+	*	CRM and ERP Systems: Embed the model into your enterprise systems to provide intelligent insights or streamline processes.
+
+4. Optimize for Performance
+	*	Experiment with different batch sizes and parallelization settings in vLLM to handle more concurrent requests or improve latency.
+	*	Use Neuron monitoring tools to analyze and fine-tune the utilization of Inferentia chips for maximum efficiency.
+
+5. Scale and Extend
+	*	Add multi-model support by deploying multiple versions of your model (e.g., fine-tuned for specific tasks) and routing traffic dynamically using the ALB.
+	*	Experiment with autoscaling policies to dynamically adjust the number of running tasks based on request volume.
+
+6. Learn from Amazon’s Approach
+	*	Discover how Amazon’s engineering team scaled generative AI for Amazon Rufus, powering conversational shopping experiences during Prime Day.
+	*	Adapt lessons learned from their implementation to improve scalability, reliability, and cost-efficiency in your use case.
+## Clean up
+
+1. Destroy this blueprint
+
+```shell
+terraform destroy
+```
+
+1. Destroy core-infra resources
+
+```shell
+cd ../core-infra
+terraform destroy
+
+```
+
+## Troubleshooting
+
+
+
+## Support
+
+Please open an issue for questions or unexpected behavior