[Tool] add support for setting memory limit, sleep time, GPU IDs, and enabling quiet mode #404

ConvolutedDog · 2025-01-06T13:58:41Z

PR Category

User Experience

Type of Change

Other

Description

This commit adds support for setting memory limit, sleep time, GPU IDs, and enabling quiet mode in this script:

Quiet Mode: Add a -q or --quiet option to enable silent mode.
Memory Limit: Add a -m or --memory option to set the maximum memory usage limit, the default 3000MB is still maintained.
Sleep Time: Add a -s or --sleep option to set the wait time between retrying checks, the default 120 seconds is still maintained.
GPU IDs: Add a -g or --gpu option to specify which GPUs to monitor. Now we can provide a comma-separated list of GPU IDs or just use "all" to monitor all GPUs.

Issue

None

Progress

Change is properly reviewed (1 reviewer required, 2 recommended).
Change is responded to an issue.
Change is fully covered by a UT.

Performance

$ ./gpu_check.sh -h
Monitor GPU memory usage and wait until sufficient memory is available before proceeding.

This script checks the available memory on specified NVIDIA GPUs. If the available memory
on any specified GPU is below the specified memory usage limit, the script will wait for 
a specified time and retry.

Usage: ./gpu_check.sh [options]

Options:
  -m, --memory <MB>     Set the maximum memory usage limit (default: 30000 MB).
                        This is the minimum amount of free memory required on each GPU.
  -s, --sleep <seconds> Set the wait time between checks (default: 120 seconds).
                        This is the time the script will wait before rechecking GPU memory.
  -g, --gpu <ids>       Set the GPU IDs to monitor (default: all)
                        Use 'all' to monitor all GPUs, or specify a comma-separated list (e.g., '0,1').
  -q, --quiet           Enable quiet mode (default: false)
  -h, --help            Display this help message.

Examples:
  ./gpu_check.sh                           # Run with default values (30000 MB memory limit, 120 seconds sleep)
  ./gpu_check.sh --memory 20000            # Set memory limit to 20000 MB
  ./gpu_check.sh --sleep 60                # Set sleep time to 60 seconds
  ./gpu_check.sh --memory 15000 --sleep 30 # Set memory limit to 15000 MB and sleep time to 30 seconds
  ./gpu_check.sh --memory 15000 --gpu 0,3  # Set memory limit to 15000 MB and monitor GPU 0 and GPU 3
  ./gpu_check.sh --quiet                   # Enable quiet mode

Note: Ensure that nvidia-smi is installed and properly configured to use this script.

… enabling quiet mode This commit adds support for setting memory limit, sleep time, GPU IDs, and enabling quiet mode in this script: 1. **Quiet Mode**: Add a `-q` or `--quiet` option to enable silent mode. 2. **Memory Limit**: Add a `-m` or `--memory` option to set the maximum memory usage limit, the default 3000MB is still maintained. 3. **Sleep Time**: Added a `-s` or `--sleep` option to set the wait time between retrying checks, the default 120 seconds is still maintained. 4. **GPU IDs**: Added a `-g` or `--gpu` option to specify which GPUs to monitor. Now we can provide a comma-separated list of GPU IDs or just use "all" to monitor all GPUs.

StrongSpoon · 2025-01-07T01:45:21Z

Thanks for your contribution! We are curious about why do you think FlagGems needs a script to control GPU, and how does FlagGems benefit from it. Besides, since FlagGems faces with multiple backends, we need to verify it on heterogeneous accelerators, which might not be compatible with GPU's commands.

ConvolutedDog · 2025-01-07T04:16:56Z

Thanks for your contribution! We are curious about why do you think FlagGems needs a script to control GPU, and how does FlagGems benefit from it. Besides, since FlagGems faces with multiple backends, we need to verify it on heterogeneous accelerators, which might not be compatible with GPU's commands.

Hi, this script was not originally added by me; I only enhanced it. It's used in the actions for op test to isolate automatic testing from other PRs. If multi-backend support is needed in the future, other additional scripts are necessary.

StrongSpoon · 2025-01-15T07:50:28Z

@Galaxy1458 please review the code and decide if multi-backend needs this function.

Galaxy1458 · 2025-01-15T08:28:06Z

@StrongSpoon As @ConvolutedDog says, this PR only enhances the functionality of the original gpu_check.sh. And almost all files of the current CI will need to be refactored later if multi-backend CI are needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tool] add support for setting memory limit, sleep time, GPU IDs, and enabling quiet mode #404

[Tool] add support for setting memory limit, sleep time, GPU IDs, and enabling quiet mode #404

ConvolutedDog commented Jan 6, 2025 •

edited

Loading

StrongSpoon commented Jan 7, 2025

ConvolutedDog commented Jan 7, 2025

StrongSpoon commented Jan 15, 2025

Galaxy1458 commented Jan 15, 2025

[Tool] add support for setting memory limit, sleep time, GPU IDs, and enabling quiet mode #404

Are you sure you want to change the base?

[Tool] add support for setting memory limit, sleep time, GPU IDs, and enabling quiet mode #404

Conversation

ConvolutedDog commented Jan 6, 2025 • edited Loading

PR Category

Type of Change

Description

Issue

Progress

Performance

StrongSpoon commented Jan 7, 2025

ConvolutedDog commented Jan 7, 2025

StrongSpoon commented Jan 15, 2025

Galaxy1458 commented Jan 15, 2025

ConvolutedDog commented Jan 6, 2025 •

edited

Loading