Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tool] add support for setting memory limit, sleep time, GPU IDs, and enabling quiet mode #404

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ConvolutedDog
Copy link

@ConvolutedDog ConvolutedDog commented Jan 6, 2025

PR Category

User Experience

Type of Change

Other

Description

This commit adds support for setting memory limit, sleep time, GPU IDs, and enabling quiet mode in this script:

  1. Quiet Mode: Add a -q or --quiet option to enable silent mode.
  2. Memory Limit: Add a -m or --memory option to set the maximum memory usage limit, the default 3000MB is still maintained.
  3. Sleep Time: Add a -s or --sleep option to set the wait time between retrying checks, the default 120 seconds is still maintained.
  4. GPU IDs: Add a -g or --gpu option to specify which GPUs to monitor. Now we can provide a comma-separated list of GPU IDs or just use "all" to monitor all GPUs.

Issue

None

Progress

  • Change is properly reviewed (1 reviewer required, 2 recommended).
  • Change is responded to an issue.
  • Change is fully covered by a UT.

Performance

$ ./gpu_check.sh -h
Monitor GPU memory usage and wait until sufficient memory is available before proceeding.

This script checks the available memory on specified NVIDIA GPUs. If the available memory
on any specified GPU is below the specified memory usage limit, the script will wait for 
a specified time and retry.

Usage: ./gpu_check.sh [options]

Options:
  -m, --memory <MB>     Set the maximum memory usage limit (default: 30000 MB).
                        This is the minimum amount of free memory required on each GPU.
  -s, --sleep <seconds> Set the wait time between checks (default: 120 seconds).
                        This is the time the script will wait before rechecking GPU memory.
  -g, --gpu <ids>       Set the GPU IDs to monitor (default: all)
                        Use 'all' to monitor all GPUs, or specify a comma-separated list (e.g., '0,1').
  -q, --quiet           Enable quiet mode (default: false)
  -h, --help            Display this help message.

Examples:
  ./gpu_check.sh                           # Run with default values (30000 MB memory limit, 120 seconds sleep)
  ./gpu_check.sh --memory 20000            # Set memory limit to 20000 MB
  ./gpu_check.sh --sleep 60                # Set sleep time to 60 seconds
  ./gpu_check.sh --memory 15000 --sleep 30 # Set memory limit to 15000 MB and sleep time to 30 seconds
  ./gpu_check.sh --memory 15000 --gpu 0,3  # Set memory limit to 15000 MB and monitor GPU 0 and GPU 3
  ./gpu_check.sh --quiet                   # Enable quiet mode

Note: Ensure that nvidia-smi is installed and properly configured to use this script.

… enabling quiet mode

This commit adds support for setting memory limit, sleep time, GPU IDs, and enabling quiet mode in this script:
1. **Quiet Mode**: Add a `-q` or `--quiet` option to enable silent mode.
2. **Memory Limit**: Add a `-m` or `--memory` option to set the maximum memory usage limit, the default 3000MB is still maintained.
3. **Sleep Time**: Added a `-s` or `--sleep` option to set the wait time between retrying checks, the default 120 seconds is still maintained.
4. **GPU IDs**: Added a `-g` or `--gpu` option to specify which GPUs to monitor. Now we can provide a comma-separated list of GPU IDs or just use "all" to monitor all GPUs.
@StrongSpoon
Copy link
Collaborator

Thanks for your contribution! We are curious about why do you think FlagGems needs a script to control GPU, and how does FlagGems benefit from it. Besides, since FlagGems faces with multiple backends, we need to verify it on heterogeneous accelerators, which might not be compatible with GPU's commands.

@ConvolutedDog
Copy link
Author

Thanks for your contribution! We are curious about why do you think FlagGems needs a script to control GPU, and how does FlagGems benefit from it. Besides, since FlagGems faces with multiple backends, we need to verify it on heterogeneous accelerators, which might not be compatible with GPU's commands.

Hi, this script was not originally added by me; I only enhanced it. It's used in the actions for op test to isolate automatic testing from other PRs. If multi-backend support is needed in the future, other additional scripts are necessary.

@StrongSpoon
Copy link
Collaborator

@Galaxy1458 please review the code and decide if multi-backend needs this function.

@Galaxy1458
Copy link
Collaborator

@StrongSpoon As @ConvolutedDog says, this PR only enhances the functionality of the original gpu_check.sh. And almost all files of the current CI will need to be refactored later if multi-backend CI are needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants