This repository contains tools for tracking GPU usage and generating dashboards.
- Collect GPU usage data from multiple companies and projects
- Generate daily, weekly, monthly, and all-time GPU usage reports
- Update dashboards using Weights & Biases (wandb)
- Detect and alert on abnormal GPU usage rates
.
├── Dockerfile.check_dashboard
├── Dockerfile.main
├── README.md
├── config.yaml
├── main.py
├── requirements.txt
├── src
│ ├── alart
│ │ └── check_dashboard.py
│ ├── calculator
│ │ ├── blank_table.py
│ │ ├── gpu_usage_calculator.py
│ │ └── remove_tags.py
│ ├── tracker
│ │ ├── common.py
│ │ ├── config_parser.py
│ │ ├── run_manager.py
│ │ └── set_gpucount.py
│ ├── uploader
│ │ ├── artifact_handler.py
│ │ ├── data_processor.py
│ │ └── run_uploader.py
│ └── utils
│ └── config.py
└── image
└── gpu-dashboard.drawio.png
In the gpu-dashboard directory, run the following commands:
$ python3 -m venv .venv
$ . .venv/bin/activate
$ pip install -r requirements.txt
Request an AWS account from the administrator and assign access permissions to the following services in IAM:
- AWSBatch
- CloudWatch
- EC2
- ECS
- ECR
- EventBridge
- IAM
- VPC
Create a user for AWS CLI in IAM. Assign access permissions to the following service:
- ECR
Click on the created user and note down the following strings from the Access Keys tab:
- Access key ID
- Secret access key
Run the following command in your local Terminal to log in to AWS:
$ aws configure
AWS Access Key ID [None]: Access key ID
# Enter
AWS Secret Access Key [None]: Secret access key
# Enter
Default region name [None]: Leave blank
# Enter
Default output format [None]: Leave blank
# Enter
After configuration, check the connection with the following command. If successful, it will output the list of S3 files:
$ aws s3 ls
Reference: AWS CLI Setup Tutorial
- Navigate to
Amazon ECR > Private registry > Repositories
- Click
Create repository
- Enter a repository name (e.g., geniac-gpu)
- Click
Create repository
- Click on the created repository name
- Click
View push commands
- Execute the four displayed commands in order in your local Terminal
# Example commands
$ aws ecr get-login-password --region ap-northeast-1 | docker login --username AWS --password-stdin 111122223333.dkr.ecr.ap-northeast-1.amazonaws.com
$ docker build -t geniac-gpu .
$ docker tag geniac-gpu:latest 111122223333.dkr.ecr.ap-northeast-1.amazonaws.com/geniac-gpu:latest
$ docker push 111122223333.dkr.ecr.ap-northeast-1.amazonaws.com/geniac-gpu:latest
As the commands are unique to each repository, you can easily deploy from the second time onwards by writing these commands in a shell script
Create repositories for both gpu-dashboard
and check-dashboard
following the above steps
- Navigate to
Virtual Private Cloud > Your VPCs
- Click
Create VPC
- Select
VPC and more
fromResources to create
- Click
Create VPC
- Navigate to
IAM > Roles
- Click
Create role
- Set up the
Use case
:- Select
Elastic Container Service
forService
- Select
Elastic Container Service Task
forUse case
- Select
- Select
AmazonEC2ContainerRegistryReadOnly
andCloudWatchLogsFullAccess
forPermission policies
- Click
Next
- Enter
ecsTaskExecutionRole
forRole name
- Click
Create role
- Navigate to
Amazon Elastic Container Service > Clusters
- Click
Create Cluster
- Enter a cluster name
- Click
Create
- Navigate to
Amazon Elastic Container Service > Task Definitions
- Click
Create new Task Definition
, then clickCreate new Task Definition
- Enter a task definition family name
- Change
CPU
andMemory
inTask size
as needed - Select
ecsTaskExecutionRole
forTask role
- Set up
Container - 1
:- Enter the repository name and image URI pushed to ECR in
Container details
- Set
Resource allocation limits
appropriately according toTask size
- Enter the repository name and image URI pushed to ECR in
- Click
Add environment variable
inEnvironment variables - optional
and add the following:- Key: WANDB_API_KEY
- Value: {Your WANDB_API_KEY}
- Click
Create
- Navigate to
Amazon Elastic Container Service > Clusters > {Cluster Name} > Scheduled Tasks
- Click
Create
- Enter a rule name for
Scheduled rule name
- Select
cron expression
forScheduled rule type
- Enter an appropriate expression in
cron expression
- Note that in this UI, you need to enter UTC time, so
cron(15 15 * * ? *)
would be 0:15 AM Japan time
- Note that in this UI, you need to enter UTC time, so
- Enter a target ID for
Target ID
- Select the task definition from
Task Definition family
- Select VPC and subnets in
Networking
- If there's no existing security group in
Security group
, selectCreate a new security group
and create one - Click
Create
Execute the following commands to set up a local Python environment for running the scheduled script.
You can edit config.yaml
to minimize impact on the production environment.
$ cd gpu-dashboard
$ python3 -m venv .venv
$ . .venv/bin/activate
python main.py [--api WANDB_API_KEY] [--start-date YYYY-MM-DD] [--end-date YYYY-MM-DD]
--api: wandb API key (optional, can be set as an environment variable) --start-date: Data retrieval start date (optional) --end-date: Data retrieval end date (optional)
python src/alart/check_dashboard.py
- src/tracker/: GPU usage data collection
- src/calculator/: GPU usage statistics calculation
- src/uploader/: Data upload to wandb
- src/alart/: Anomaly detection and alert functionality
- In AWS, navigate to
CloudWatch > Log groups
- Click on
/ecs/{task definition name}
- Click on the log stream to view logs
- Fetch latest data (src/tracker/)
- Set start_date and end_date
- If unspecified, both values default to yesterday's date
- Create a list of companies
- Fetch projects for each company [Public API]
- Fetch runs for each project [Private API]
- Filter by target_date, tags
- Detect and alert runs that initialize wandb multiple times on the same instance
- Fetch system metrics for each run [Public API]
- Aggregate by run id x date
- Set start_date and end_date
- Update data (src/uploader/)
- Retrieve csv up to yesterday from Artifacts
- Concatenate with the latest data and save to Artifacts
- Filter run ids
- Aggregate and update data (src/calculator)
- Remove latest tag
- Aggregate retrieved data
- Aggregate overall data
- Aggregate monthly data
- Aggregate weekly data
- Aggregate daily data
- Aggregate summary data
- Update overall table
- Update tables for each company
Here's the English translation of the text:
In the __set_gpucount method within src/tracker/run_manager.py, GPU counts for distributed processing are calculated based on different teams and configurations. Below are the calculation methods and specific examples.
- Retrieve the values of
num_nodes
andnum_gpus
, and multiply them to calculate the GPU count. - These values are obtained from the
config
section in the configuration file.
config = { "num_nodes": 2, "num_gpus": 8 }
gpu_count = 2 * 8 = 16
In this example, there are 2 nodes, each with 8 GPUs, resulting in a total GPU count of 16.
- Use the value of
world_size
to determine the GPU count. world_size
is obtained from theconfig
section in the configuration file.
config = { "world_size": 16 }
gpu_count = 16
In this example, the value of world_size
is directly used as the GPU count.
- Use the value of
node.runInfo.gpuCount
.
node.runInfo = { "gpuCount": 8 }
gpu_count = 8
In this example, the GPU count is directly obtained from runInfo
.
This method aims to calculate the GPU count as accurately as possible by accommodating various configuration formats. However, when encountering unexpected data formats, it sets the GPU count to 0 for safety and outputs a warning.