gpu-dashboard

This repository contains tools for tracking GPU usage and generating dashboards.

Main Features

Collect GPU usage data from multiple companies and projects
Generate daily, weekly, monthly, and all-time GPU usage reports
Update dashboards using Weights & Biases (wandb)
Detect and alert on abnormal GPU usage rates

Architecture

Directory Structure of this Repository

.
├── Dockerfile.check_dashboard
├── Dockerfile.main
├── README.md
├── config.yaml
├── main.py
├── requirements.txt
├── src
│   ├── alart
│   │   └── check_dashboard.py
│   ├── calculator
│   │   ├── blank_table.py
│   │   ├── gpu_usage_calculator.py
│   │   └── remove_tags.py
│   ├── tracker
│   │   ├── common.py
│   │   ├── config_parser.py
│   │   ├── run_manager.py
│   │   └── set_gpucount.py
│   ├── uploader
│   │   ├── artifact_handler.py
│   │   ├── data_processor.py
│   │   └── run_uploader.py
│   └── utils
│       └── config.py
└── image
    └── gpu-dashboard.drawio.png

Local Environment Setup

In the gpu-dashboard directory, run the following commands:

$ python3 -m venv .venv
$ . .venv/bin/activate
$ pip install -r requirements.txt

AWS Environment Setup

Account Creation and Permission Assignment

Request an AWS account from the administrator and assign access permissions to the following services in IAM:

AWSBatch
CloudWatch
EC2
ECS
ECR
EventBridge
IAM
VPC

AWS CLI Configuration

Create a user for AWS CLI in IAM. Assign access permissions to the following service:

ECR

Click on the created user and note down the following strings from the Access Keys tab:

Access key ID
Secret access key

Run the following command in your local Terminal to log in to AWS:

$ aws configure

AWS Access Key ID [None]: Access key ID
# Enter
AWS Secret Access Key [None]: Secret access key
# Enter
Default region name [None]: Leave blank
# Enter
Default output format [None]: Leave blank
# Enter

After configuration, check the connection with the following command. If successful, it will output the list of S3 files:

$ aws s3 ls

Reference: AWS CLI Setup Tutorial

Deploying the Scheduled Program

ECR

Creating a Repository

Navigate to Amazon ECR > Private registry > Repositories
Click Create repository
Enter a repository name (e.g., geniac-gpu)
Click Create repository

Pushing Images

Click on the created repository name
Click View push commands
Execute the four displayed commands in order in your local Terminal

# Example commands
$ aws ecr get-login-password --region ap-northeast-1 | docker login --username AWS --password-stdin 111122223333.dkr.ecr.ap-northeast-1.amazonaws.com
$ docker build -t geniac-gpu .
$ docker tag geniac-gpu:latest 111122223333.dkr.ecr.ap-northeast-1.amazonaws.com/geniac-gpu:latest
$ docker push 111122223333.dkr.ecr.ap-northeast-1.amazonaws.com/geniac-gpu:latest

As the commands are unique to each repository, you can easily deploy from the second time onwards by writing these commands in a shell script

Create repositories for both gpu-dashboard and check-dashboard following the above steps

VPC

Navigate to Virtual Private Cloud > Your VPCs
Click Create VPC
Select VPC and more from Resources to create
Click Create VPC

IAM

Navigate to IAM > Roles
Click Create role
Set up the Use case:
- Select Elastic Container Service for Service
- Select Elastic Container Service Task for Use case
Select AmazonEC2ContainerRegistryReadOnly and CloudWatchLogsFullAccess for Permission policies
Click Next
Enter ecsTaskExecutionRole for Role name
Click Create role

ECS

Create Cluster

Navigate to Amazon Elastic Container Service > Clusters
Click Create Cluster
Enter a cluster name
Click Create

Task Definition

Navigate to Amazon Elastic Container Service > Task Definitions
Click Create new Task Definition, then click Create new Task Definition
Enter a task definition family name
Change CPU and Memory in Task size as needed
Select ecsTaskExecutionRole for Task role
Set up Container - 1:
- Enter the repository name and image URI pushed to ECR in Container details
- Set Resource allocation limits appropriately according to Task size
Click Add environment variable in Environment variables - optional and add the following:
- Key: WANDB_API_KEY
- Value: {Your WANDB_API_KEY}
Click Create

Create Task

Navigate to Amazon Elastic Container Service > Clusters > {Cluster Name} > Scheduled Tasks
Click Create
Enter a rule name for Scheduled rule name
Select cron expression for Scheduled rule type
Enter an appropriate expression in cron expression
- Note that in this UI, you need to enter UTC time, so cron(15 15 * * ? *) would be 0:15 AM Japan time
Enter a target ID for Target ID
Select the task definition from Task Definition family
Select VPC and subnets in Networking
If there's no existing security group in Security group, select Create a new security group and create one
Click Create

Debugging

Local Environment Setup

Execute the following commands to set up a local Python environment for running the scheduled script. You can edit config.yaml to minimize impact on the production environment.

$ cd gpu-dashboard
$ python3 -m venv .venv
$ . .venv/bin/activate

Usage

Running the Main Script

python main.py [--api WANDB_API_KEY] [--start-date YYYY-MM-DD] [--end-date YYYY-MM-DD]

--api: wandb API key (optional, can be set as an environment variable) --start-date: Data retrieval start date (optional) --end-date: Data retrieval end date (optional)

Checking Dashboard Health

python src/alart/check_dashboard.py

Main Components

src/tracker/: GPU usage data collection
src/calculator/: GPU usage statistics calculation
src/uploader/: Data upload to wandb
src/alart/: Anomaly detection and alert functionality

How to Check Logs

In AWS, navigate to CloudWatch > Log groups
Click on /ecs/{task definition name}
Click on the log stream to view logs

Appendix

Program Processing Steps

Fetch latest data (src/tracker/)
- Set start_date and end_date
  - If unspecified, both values default to yesterday's date
- Create a list of companies
- Fetch projects for each company [Public API]
- Fetch runs for each project [Private API]
  - Filter by target_date, tags
- Detect and alert runs that initialize wandb multiple times on the same instance
- Fetch system metrics for each run [Public API]
- Aggregate by run id x date
Update data (src/uploader/)
- Retrieve csv up to yesterday from Artifacts
- Concatenate with the latest data and save to Artifacts
- Filter run ids
Aggregate and update data (src/calculator)
- Remove latest tag
- Aggregate retrieved data
  - Aggregate overall data
  - Aggregate monthly data
  - Aggregate weekly data
  - Aggregate daily data
  - Aggregate summary data
- Update overall table
- Update tables for each company

Here's the English translation of the text:

GPU Count Calculation Examples for Distributed Processing

In the __set_gpucount method within src/tracker/run_manager.py, GPU counts for distributed processing are calculated based on different teams and configurations. Below are the calculation methods and specific examples.

1. When num_nodes and num_gpus values are included in the config

Calculation Method

Retrieve the values of num_nodes and num_gpus, and multiply them to calculate the GPU count.
These values are obtained from the config section in the configuration file.

Specific Example

config = { "num_nodes": 2, "num_gpus": 8 }

gpu_count = 2 * 8 = 16

In this example, there are 2 nodes, each with 8 GPUs, resulting in a total GPU count of 16.

2. When world_size is included in the config

Calculation Method

Use the value of world_size to determine the GPU count.
world_size is obtained from the config section in the configuration file.

Specific Example

config = { "world_size": 16 }

gpu_count = 16

In this example, the value of world_size is directly used as the GPU count.

3. When distributed processing settings cannot be obtained from the config

Calculation Method

Use the value of node.runInfo.gpuCount.

node.runInfo = { "gpuCount": 8 }

gpu_count = 8

In this example, the GPU count is directly obtained from runInfo.

Notes

This method aims to calculate the GPU count as accurately as possible by accommodating various configuration formats. However, when encountering unexpected data formats, it sets the GPU count to 0 for safety and outputs a warning.

Name		Name	Last commit message	Last commit date
Latest commit History 265 Commits
.github/workflows		.github/workflows
image		image
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile.check_dashboard		Dockerfile.check_dashboard
Dockerfile.main		Dockerfile.main
README.md		README.md
README_ja.md		README_ja.md
config.yaml		config.yaml
create_weekly_report.ipynb		create_weekly_report.ipynb
download_tables.ipynb		download_tables.ipynb
main.py		main.py
requirements.txt		requirements.txt

wandb/gpu_dashboard

Folders and files

Latest commit

History

Repository files navigation

gpu-dashboard

Main Features

Architecture

Directory Structure of this Repository

Local Environment Setup

AWS Environment Setup

Account Creation and Permission Assignment

AWS CLI Configuration

Deploying the Scheduled Program

ECR

Creating a Repository

Pushing Images

VPC

IAM

ECS

Create Cluster

Task Definition

Create Task

Debugging

Local Environment Setup

Usage

Running the Main Script

Checking Dashboard Health

Main Components

How to Check Logs

Appendix

Program Processing Steps

GPU Count Calculation Examples for Distributed Processing

1. When num_nodes and num_gpus values are included in the config

Calculation Method

Specific Example

2. When world_size is included in the config

Calculation Method

Specific Example

3. When distributed processing settings cannot be obtained from the config

Calculation Method

Notes

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 4

Languages

Packages