Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: use device in all Torch models #5026

Merged
merged 10 commits into from
Jan 17, 2025

Conversation

jacobsela
Copy link
Contributor

@jacobsela jacobsela commented Oct 31, 2024

Resolves #5271

Summary by CodeRabbit

  • New Features

    • Added device configuration options for machine learning models.
    • Enhanced model compatibility with different hardware setups.
  • Improvements

    • Improved device management for GPU and CPU processing.
    • More flexible device selection for transformer and AI models.
  • Technical Updates

    • Updated device handling methods across multiple utility classes.
    • Introduced device attribute in configuration classes for more precise control.

Copy link
Contributor

coderabbitai bot commented Oct 31, 2024

Walkthrough

The changes involve modifications to the device management in the TorchOpenClipModel, TorchYoloNasModel, and transformer classes within the fiftyone/utils/open_clip.py, fiftyone/utils/super_gradients.py, fiftyone/utils/transformers.py, and fiftyone/utils/ultralytics.py files. The updates replace direct calls to .cuda() with .to(self.device) for moving tensors and models to the appropriate device, enhancing compatibility across different hardware configurations.

Changes

File Change Summary
fiftyone/utils/open_clip.py Updated _get_text_features, _embed_prompts, and _predict_all methods to use text.to(self.device) and imgs.to(self.device) for device management.
fiftyone/utils/super_gradients.py Modified _load_model method to use model.to(self.device) for transferring the model to the appropriate device.
fiftyone/utils/transformers.py Introduced device attribute in FiftyOneTransformerConfig and FiftyOneZeroShotTransformerConfig, modified initialization in transformer classes to utilize this attribute for device management.
fiftyone/utils/ultralytics.py Added device attribute to FiftyOneYOLOModelConfig and updated the constructor in FiftyOneYOLOModel to use model.to(self.device).

Assessment against linked issues

Objective Addressed Explanation
Resolve hardcoded CUDA device issue in apply_model (#5271)

Possibly related PRs

  • Transformers GPU Support #4987: The changes in this PR also focus on device management for models, specifically enhancing GPU support in the fiftyone/utils/transformers.py file, which aligns with the device handling improvements made in the main PR's TorchOpenClipModel class.

Suggested reviewers

  • brimoor

Poem

In the patch of code, a rabbit hops,
With changes made, it never stops.
Through functions and loops, it scurries with glee,
Enhancing the zoo for all to see! 🐇✨

Finishing Touches

  • 📝 Generate Docstrings

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@danielgural
Copy link
Contributor

danielgural commented Oct 31, 2024

Still works fine and I can see difference between cpu and cuda. Note for future, this change is not pulled upstream by

fob.compute_similarity(
    dataset,
    model="clip-vit-base32-torch",
    brain_key="img_sim",
    device="cuda",
)

and just noticed. Something for next time :)

@harpreetsahota204 can you run this code when you test:

import fiftyone.brain as fob
model = foz.load_zoo_model("clip-vit-base32-torch", device="cuda")
print(model._model.visual.conv1._parameters["weight"][0].device)

To make sure the model is also multi-gpu

danielgural
danielgural previously approved these changes Oct 31, 2024
Copy link
Contributor

@danielgural danielgural left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

import fiftyone.brain as fob
from fiftyone import ViewField as F

dataset = foz.load_zoo_dataset('quickstart')
session = fo.launch_app(dataset)
model = foz.load_zoo_model("clip-vit-base32-torch", device="cuda")
embeddings = dataset.compute_embeddings(model)

worked as expected

fiftyone/utils/clip/zoo.py Outdated Show resolved Hide resolved
@brimoor brimoor changed the title bugfix Use device in all Torch models Nov 1, 2024
@brimoor brimoor changed the title Use device in all Torch models Bugfix: use device in all Torch models Nov 1, 2024
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)
fiftyone/utils/super_gradients.py (1)

98-100: Consider adding a docstring note about device flexibility.

Since this change enables flexible device selection, it would be helpful to document this capability in the class or method docstring. This would help users understand that they can use any available GPU.

Add a note like this to the class docstring:

 """FiftyOne wrapper around YOLO-NAS from
 https://github.com/Deci-AI/super-gradients.
+
+The model automatically uses the appropriate device (CPU/GPU) based on availability
+and can work with any CUDA device, not just the default one.

 Args:
     config: a :class:`TorchYoloNasModelConfig`
 """
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 3b3596f and 9a89a70.

📒 Files selected for processing (3)
  • fiftyone/utils/clip/zoo.py (1 hunks)
  • fiftyone/utils/open_clip.py (3 hunks)
  • fiftyone/utils/super_gradients.py (1 hunks)
🔥 Files not summarized due to errors (1)
  • fiftyone/utils/clip/zoo.py: Error: Server error: no LLM provider could handle the message
🔇 Additional comments (4)
fiftyone/utils/super_gradients.py (1)

99-99: LGTM! Device management improvement.

The change from model.cuda() to model.to(self.device) improves GPU device flexibility, allowing the model to work with any available GPU instead of being restricted to the default one.

Let's verify that the model parameters are correctly moved to the specified device:

✅ Verification successful

Device management change is properly implemented

The verification confirms that the change from model.cuda() to model.to(self.device) is correct and consistent with the codebase's device management pattern. Key findings:

  1. The base TorchImageModel class properly initializes self.device in its constructor:
device = self.config.device
if device is None:
    device = "cuda:0" if torch.cuda.is_available() else "cpu"
  1. All other model implementations (CLIP, SAM, YOLO-NAS) consistently use self.device for device management.

  2. The change aligns with the existing pattern used across the codebase for moving tensors and models to the target device.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if there are any remaining hard-coded device assignments
# that might conflict with the new flexible device management

# Test 1: Look for any remaining .cuda() calls that might bypass device management
rg -g '*.py' '\.cuda\(' 

# Test 2: Check for proper device initialization in the parent class
ast-grep --pattern 'class TorchImageModel:
  $$$
  def __init__($$$):
    $$$
    self.device = $$$
    $$$
'

Length of output: 137


Script:

#!/bin/bash
# Let's try a different approach to verify device management

# Check for device-related code in TorchImageModel and its subclasses
rg -g '*.py' 'class.*TorchImageModel' -A 10

# Look for device initialization patterns
rg -g '*.py' 'self\.device' -B 2 -A 2

# Check for any remaining cuda-specific code
rg -g '*.py' 'cuda|gpu' -B 2 -A 2

Length of output: 26257

fiftyone/utils/open_clip.py (3)

109-109: LGTM: Device handling improvement in text feature extraction.

The change from .cuda() to .to(self.device) is correct and aligns with the PR's objective of supporting multiple GPUs.


121-121: LGTM: Device handling improvement in prompt embedding.

The change from .cuda() to .to(self.device) is correct and consistent with the previous change.


109-109: Verify device consistency across model components.

The device handling changes look good, but we should verify that all model components (model, tensors, etc.) are consistently using the same device.

Also applies to: 121-121, 146-146

✅ Verification successful

Based on the shell script results, I can now generate the final response as the device handling patterns are clear across the codebase.

Device handling is consistent across model components

The device handling in the OpenCLIP implementation follows the established patterns used throughout the codebase:

  • The model and tensors are consistently moved to the same device using to(self.device)
  • Device initialization follows the standard pattern of defaulting to CUDA if available, otherwise CPU
  • All tensor operations (text encoding, image encoding) maintain device consistency
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify device handling consistency across the codebase
# Look for any remaining .cuda() calls or device-related patterns

# Test 1: Check for any remaining .cuda() calls
echo "Checking for remaining .cuda() calls..."
rg '\.cuda\(' 

# Test 2: Check for device-related patterns in TorchOpenClipModel and related classes
echo "Checking device handling patterns..."
rg -A 3 'to\(.*device|\.device|device\s*=' 

Length of output: 14047

@@ -143,7 +143,7 @@ def _predict_all(self, imgs):
frame_size = (width, height)

if self._using_gpu:
imgs = imgs.cuda()
imgs = imgs.to(self.device)
Copy link
Contributor

@coderabbitai coderabbitai bot Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Consider adding error handling for device transfers.

While the device handling change is correct, the subsequent autocast context is hardcoded to "cuda". This might cause issues when running on CPU.

Apply this diff to make it device-aware:

         if self._using_gpu:
             imgs = imgs.to(self.device)
 
-        with torch.no_grad(), torch.amp.autocast("cuda"):
+        with torch.no_grad(), torch.amp.autocast(device_type=self.device.type if self._using_gpu else "cpu"):
             image_features = self._model.encode_image(imgs)
             text_features = self._get_text_features()

Also applies to: 147-152

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacobsela coderabbit raises an interesting point here. Does torch.amp.autocast("cuda") need to be updated?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is valid and will cause problems if not handled. It's in my todo for this week to more thoroughly review the code before moving further with this PR because this message makes me think that there are probably more places I haven't noticed that make hardware assumptions.

@brimoor brimoor changed the base branch from release/v1.0.2 to develop November 7, 2024 23:16
@brimoor
Copy link
Contributor

brimoor commented Jan 5, 2025

@jacobsela friendly ping here; can we get this completed and merged?

@deltheil
Copy link

deltheil commented Jan 6, 2025

Any chance to modify fiftyone/utils/transformers.py in addition (since Hf transformers models are impacted too: see #5271)? Thanks!

@jacobsela
Copy link
Contributor Author

@deltheil That can be done, we'll look into it.

@jacobsela
Copy link
Contributor Author

jacobsela commented Jan 9, 2025

@brimoor

Currently testing all zoo models. There seem to be some other unrelated issues that may be worth addressing, e.g. #5359 and an error pasted below with open clip.

I'll push the fixes for transformers once git goes back up. (EDIT: up now)

This also makes me think that we may need proper testing when adding new zoo models. The code isn't very consistent. Not sure if it's worth the time sink though.

Current status:

Tested devices ['cpu', 'cuda:2']

Tested models - all pass besides open-clip-torch

classification-transformer-torch
clip-vit-base32-torch
mnasnet0.5-imagenet-torch
keypoint-rcnn-resnet50-fpn-coco-torch
depth-estimation-transformer-torch
segment-anything-2.1-hiera-tiny-image-torch
resnext101-32x8d-imagenet-torch
yolov5s-coco-torch
mobilenet-v2-imagenet-torch
wide-resnet50-2-imagenet-torch
vgg16-imagenet-torch
densenet121-imagenet-torch
densenet201-imagenet-torch
dinov2-vits14-torch
fcn-resnet50-coco-torch
retinanet-resnet50-fpn-coco-torch
densenet161-imagenet-torch
vgg13-bn-imagenet-torch
segment-anything-2-hiera-large-image-torch
resnet152-imagenet-torch
wide-resnet101-2-imagenet-torch
dinov2-vitb14-torch
vgg16-bn-imagenet-torch
deeplabv3-resnet50-coco-torch
vgg13-imagenet-torch
detection-transformer-torch
fcn-resnet101-coco-torch
squeezenet-imagenet-torch
resnet50-imagenet-torch
squeezenet-1.1-imagenet-torch
yolov5l-coco-torch
vgg11-bn-imagenet-torch
vgg19-bn-imagenet-torch
resnet34-imagenet-torch
shufflenetv2-1.0x-imagenet-torch
faster-rcnn-resnet50-fpn-coco-torch
resnet18-imagenet-torch
resnext50-32x4d-imagenet-torch
mnasnet1.0-imagenet-torch
alexnet-imagenet-torch
yolov5x-coco-torch
vgg11-imagenet-torch
mask-rcnn-resnet50-fpn-coco-torch
segment-anything-2.1-hiera-large-image-torch
segment-anything-2-hiera-small-image-torch
googlenet-imagenet-torch
densenet169-imagenet-torch
inception-v3-imagenet-torch
segment-anything-2.1-hiera-small-image-torch
segmentation-transformer-torch
dinov2-vitg14-torch
resnet101-imagenet-torch
segment-anything-2-hiera-base-plus-image-torch
shufflenetv2-0.5x-imagenet-torch
segment-anything-2.1-hiera-base-plus-image-torch
yolov5m-coco-torch
deeplabv3-resnet101-coco-torch
dinov2-vitl14-torch
yolov5n-coco-torch
open-clip-torch
vgg19-imagenet-torch
segment-anything-2-hiera-tiny-image-torch

Not tested models (I need to setup an environment to test all of these):

Model Why test was skipped
med-sam-2-video-torch Model is not an image model
rtdetr-l-coco-torch Model does not have a device attribute
rtdetr-x-coco-torch Model does not have a device attribute
segment-anything-2-hiera-base-plus-video-torch Model is not an image model
segment-anything-2-hiera-large-video-torch Model is not an image model
segment-anything-2-hiera-small-video-torch Model is not an image model
segment-anything-2-hiera-tiny-video-torch Model is not an image model
segment-anything-2.1-hiera-base-plus-video-torch Model is not an image model
segment-anything-2.1-hiera-large-video-torch Model is not an image model
segment-anything-2.1-hiera-small-video-torch Model is not an image model
segment-anything-2.1-hiera-tiny-video-torch Model is not an image model
segment-anything-vitb-torch Failed to load model
segment-anything-vith-torch Failed to load model
segment-anything-vitl-torch Failed to load model
yolo-nas-torch Failed to load model
yolo11l-coco-torch Model does not have a device attribute
yolo11l-seg-coco-torch Model does not have a device attribute
yolo11m-coco-torch Model does not have a device attribute
yolo11m-seg-coco-torch Model does not have a device attribute
yolo11n-coco-torch Model does not have a device attribute
yolo11n-seg-coco-torch Model does not have a device attribute
yolo11s-coco-torch Model does not have a device attribute
yolo11s-seg-coco-torch Model does not have a device attribute
yolo11x-coco-torch Model does not have a device attribute
yolo11x-seg-coco-torch Model does not have a device attribute
yolov10l-coco-torch Model does not have a device attribute
yolov10m-coco-torch Model does not have a device attribute
yolov10n-coco-torch Model does not have a device attribute
yolov10s-coco-torch Model does not have a device attribute
yolov10x-coco-torch Model does not have a device attribute
yolov8l-coco-torch Model does not have a device attribute
yolov8l-obb-dotav1-torch Model does not have a device attribute
yolov8l-oiv7-torch Model does not have a device attribute
yolov8l-seg-coco-torch Model does not have a device attribute
yolov8l-world-torch Model does not have a device attribute
yolov8m-coco-torch Model does not have a device attribute
yolov8m-obb-dotav1-torch Model does not have a device attribute
yolov8m-oiv7-torch Model does not have a device attribute
yolov8m-seg-coco-torch Model does not have a device attribute
yolov8m-world-torch Model does not have a device attribute
yolov8n-coco-torch Model does not have a device attribute
yolov8n-obb-dotav1-torch Model does not have a device attribute
yolov8n-oiv7-torch Model does not have a device attribute
yolov8n-seg-coco-torch Model does not have a device attribute
yolov8s-coco-torch Model does not have a device attribute
yolov8s-obb-dotav1-torch Model does not have a device attribute
yolov8s-oiv7-torch Model does not have a device attribute
yolov8s-seg-coco-torch Model does not have a device attribute
yolov8s-world-torch Model does not have a device attribute
yolov8x-coco-torch Model does not have a device attribute
yolov8x-obb-dotav1-torch Model does not have a device attribute
yolov8x-oiv7-torch Model does not have a device attribute
yolov8x-seg-coco-torch Model does not have a device attribute
yolov8x-world-torch Model does not have a device attribute
yolov9c-coco-torch Model does not have a device attribute
yolov9c-seg-coco-torch Model does not have a device attribute
yolov9e-coco-torch Model does not have a device attribute
yolov9e-seg-coco-torch Model does not have a device attribute

Errors

======================================================================
ERROR: test_all_torch_image_models (main.TestDeviceUsage) (model_name='open-clip-torch', device='cpu', input_format='numpy')

Traceback (most recent call last):
File "/home/jacobs/test_device_usage.py", line 86, in _test_image_model
fo_torch_model.predict_all(dummy_inputs)
File "/home/jacobs/fiftyone/fiftyone/utils/torch.py", line 691, in predict_all
return self._predict_all(imgs)
File "/home/jacobs/fiftyone/fiftyone/utils/open_clip.py", line 137, in _predict_all
imgs = [self._preprocess(img).unsqueeze(0) for img in imgs]
File "/home/jacobs/fiftyone/fiftyone/utils/open_clip.py", line 137, in
imgs = [self._preprocess(img).unsqueeze(0) for img in imgs]
TypeError: 'bool' object is not callable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/jacobs/test_device_usage.py", line 147, in test_all_torch_image_models
self._test_image_model(fo_torch_model, model_name, device, input_format=input_format)
File "/home/jacobs/test_device_usage.py", line 90, in _test_image_model
raise Exception(f"Failed to run model {model_name} on device {device}") from e
Exception: Failed to run model open-clip-torch on device cpu

======================================================================
ERROR: test_all_torch_image_models (main.TestDeviceUsage) (model_name='open-clip-torch', device='cuda:2', input_format='numpy')

Traceback (most recent call last):
File "/home/jacobs/test_device_usage.py", line 86, in _test_image_model
fo_torch_model.predict_all(dummy_inputs)
File "/home/jacobs/fiftyone/fiftyone/utils/torch.py", line 691, in predict_all
return self._predict_all(imgs)
File "/home/jacobs/fiftyone/fiftyone/utils/open_clip.py", line 137, in _predict_all
imgs = [self._preprocess(img).unsqueeze(0) for img in imgs]
File "/home/jacobs/fiftyone/fiftyone/utils/open_clip.py", line 137, in
imgs = [self._preprocess(img).unsqueeze(0) for img in imgs]
TypeError: 'bool' object is not callable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/jacobs/test_device_usage.py", line 147, in test_all_torch_image_models
self._test_image_model(fo_torch_model, model_name, device, input_format=input_format)
File "/home/jacobs/test_device_usage.py", line 90, in _test_image_model
raise Exception(f"Failed to run model {model_name} on device {device}") from e
Exception: Failed to run model open-clip-torch on device cuda:2


Ran 1 test in 275.831s

FAILED (errors=2)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
fiftyone/utils/transformers.py (2)

326-337: Add device parameter validation and documentation.

The device handling logic is correct, but consider these improvements:

  1. Add validation for the device parameter to ensure only valid values are accepted (e.g., 'cuda', 'cpu', 'cuda:0', etc.)
  2. Document the device parameter in the class docstring.
 """Configuration for a :class:`FiftyOneTransformer`.
 
 Args:
     model (None): a ``transformers`` model
     name_or_path (None): the name or path to a checkpoint file to load
+    device (None): the device to use for model execution (e.g., 'cuda', 'cpu', 'cuda:0').
+        If not specified, uses CUDA if available, otherwise CPU.
 """

759-760: Consider refactoring device initialization to reduce code duplication.

The device initialization pattern is repeated across multiple transformer classes. Consider moving this common functionality to a base class or mixin to promote DRY principles.

Example approach:

class DeviceMixin:
    def _initialize_device(self):
        self.device = torch.device(self.config.device)
        self.model.to(self.device)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 310a6bd and f57fa2a.

📒 Files selected for processing (1)
  • fiftyone/utils/transformers.py (4 hunks)
🔇 Additional comments (3)
fiftyone/utils/transformers.py (3)

463-464: LGTM! Device handling follows PyTorch best practices.

The implementation correctly initializes the device from config and moves the model to the appropriate device.


509-510: LGTM! Device handling is consistent.

The implementation maintains consistency with the base class's device handling approach.


Line range hint 326-760: Verify tensor operations use the specified device consistently.

Let's ensure all tensor operations in the codebase consistently use the specified device.

✅ Verification successful

Device handling is properly implemented

All tensor operations in the codebase consistently use the specified device through proper device transfer mechanisms. The implementation in the reviewed code aligns with the rest of the codebase's device handling patterns.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for potential tensor operations that might not use the specified device

# Search for tensor operations that might need device handling
rg -A 2 "torch\.(tensor|Tensor|from_numpy|zeros|ones|arange|linspace|empty|full|eye|rand|randn)" fiftyone/utils/

# Search for potential cuda-specific calls that should use device instead
rg "\.cuda\(\)" fiftyone/utils/

Length of output: 9099

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
fiftyone/utils/transformers.py (1)

464-465: Consider extracting device management into a mixin class.

The device initialization and model movement logic is duplicated across multiple transformer classes. Consider creating a DeviceManagementMixin class to reduce code duplication and maintain consistency.

Example implementation:

class DeviceManagementMixin:
    def initialize_device(self):
        self.device = torch.device(self.config.device)
        self.model.to(self.device)

Then update the transformer classes to use the mixin:

class FiftyOneTransformer(DeviceManagementMixin, TransformerEmbeddingsMixin, Model):
    def __init__(self, config):
        self.config = config
        self.model = self._load_model(config)
        self.initialize_device()
        self.image_processor = self._load_image_processor()

Also applies to: 510-511, 760-761

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f57fa2a and 0adf306.

📒 Files selected for processing (2)
  • fiftyone/utils/transformers.py (4 hunks)
  • fiftyone/utils/ultralytics.py (3 hunks)
🔇 Additional comments (7)
fiftyone/utils/ultralytics.py (3)

23-23: LGTM!

The torch import is correctly added using lazy_import to avoid unnecessary imports.


382-384: LGTM!

The device configuration is correctly implemented with a sensible default that falls back to CPU if CUDA is not available.


397-398: LGTM!

The device initialization and model movement to the specified device are correctly implemented.

fiftyone/utils/transformers.py (4)

326-338: LGTM!

The device configuration logic is well-implemented with proper handling of both string-based model paths and pre-loaded models.


464-465: LGTM!

The device initialization and model movement to the specified device are correctly implemented in FiftyOneTransformer.


510-511: LGTM!

The device initialization and model movement to the specified device are correctly implemented in FiftyOneZeroShotTransformer.


760-761: LGTM!

The device initialization and model movement to the specified device are correctly implemented in FiftyOneZeroShotTransformerForObjectDetection.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
fiftyone/utils/transformers.py (1)

457-458: Consider adding error handling for device availability.

While the device initialization looks correct, it would be beneficial to add error handling for cases where the specified device is not available.

 def __init__(self, config):
     self.config = config
     self.model = self._load_model(config)
-    self.device = torch.device(self.config.device)
-    self.model.to(self.device)
+    try:
+        self.device = torch.device(self.config.device)
+        self.model.to(self.device)
+    except RuntimeError as e:
+        logger.warning(f"Failed to move model to {self.config.device}. Falling back to CPU. Error: {e}")
+        self.device = torch.device("cpu")
+        self.model.to(self.device)
     self.image_processor = self._load_image_processor()
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0adf306 and 06ead81.

📒 Files selected for processing (1)
  • fiftyone/utils/transformers.py (4 hunks)
🔇 Additional comments (2)
fiftyone/utils/transformers.py (2)

326-328: LGTM: Device configuration with sensible defaults.

The device configuration is well-implemented with a sensible default that automatically selects CUDA if available, falling back to CPU otherwise.


326-328: Verify device compatibility across the codebase.

The changes introduce device management across multiple classes. Let's verify that all model operations consistently use the specified device.

Also applies to: 457-458, 503-504, 753-754

✅ Verification successful

Device compatibility verification successful

All model operations consistently use the specified device across the codebase. Input tensors and models are properly moved to the configured device before processing, maintaining compatibility throughout the model operations.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for potential device-related issues in model operations

# Look for tensor operations that might not respect the device setting
rg -A 2 "\.to\(" --type py

# Look for direct cuda() calls that should be replaced with to(self.device)
rg "\.cuda\(" --type py

# Look for device-related patterns in model operations
ast-grep --pattern 'with torch.no_grad():
  $$$
  outputs = $model($$$)
  $$$'

Length of output: 5936

fiftyone/utils/transformers.py Show resolved Hide resolved
@jacobsela
Copy link
Contributor Author

Still no testing done for MPS

Models tested w/ various coda devices

alexnet-imagenet-torch
classification-transformer-torch
clip-vit-base32-torch
deeplabv3-resnet101-coco-torch
deeplabv3-resnet50-coco-torch
densenet121-imagenet-torch
densenet161-imagenet-torch
densenet169-imagenet-torch
densenet201-imagenet-torch
depth-estimation-transformer-torch
detection-transformer-torch
dinov2-vitb14-torch
dinov2-vitg14-torch
dinov2-vitl14-torch
dinov2-vits14-torch
faster-rcnn-resnet50-fpn-coco-torch
fcn-resnet101-coco-torch
fcn-resnet50-coco-torch
googlenet-imagenet-torch
inception-v3-imagenet-torch
keypoint-rcnn-resnet50-fpn-coco-torch
mask-rcnn-resnet50-fpn-coco-torch
mnasnet0.5-imagenet-torch
mnasnet1.0-imagenet-torch
mobilenet-v2-imagenet-torch
open-clip-torch
resnet101-imagenet-torch
resnet152-imagenet-torch
resnet18-imagenet-torch
resnet34-imagenet-torch
resnet50-imagenet-torch
resnext101-32x8d-imagenet-torch
resnext50-32x4d-imagenet-torch
retinanet-resnet50-fpn-coco-torch
rtdetr-l-coco-torch
rtdetr-x-coco-torch
segment-anything-2-hiera-base-plus-image-torch
segment-anything-2-hiera-large-image-torch
segment-anything-2-hiera-small-image-torch
segment-anything-2-hiera-tiny-image-torch
segment-anything-2.1-hiera-base-plus-image-torch
segment-anything-2.1-hiera-large-image-torch
segment-anything-2.1-hiera-small-image-torch
segment-anything-2.1-hiera-tiny-image-torch
segmentation-transformer-torch
shufflenetv2-0.5x-imagenet-torch
shufflenetv2-1.0x-imagenet-torch
squeezenet-1.1-imagenet-torch
squeezenet-imagenet-torch
vgg11-bn-imagenet-torch
vgg11-imagenet-torch
vgg13-bn-imagenet-torch
vgg13-imagenet-torch
vgg16-bn-imagenet-torch
vgg16-imagenet-torch
vgg19-bn-imagenet-torch
vgg19-imagenet-torch
wide-resnet101-2-imagenet-torch
wide-resnet50-2-imagenet-torch
yolo11l-coco-torch
yolo11l-seg-coco-torch
yolo11m-coco-torch
yolo11m-seg-coco-torch
yolo11n-coco-torch
yolo11n-seg-coco-torch
yolo11s-coco-torch
yolo11s-seg-coco-torch
yolo11x-coco-torch
yolo11x-seg-coco-torch
yolov10l-coco-torch
yolov10m-coco-torch
yolov10n-coco-torch
yolov10s-coco-torch
yolov10x-coco-torch
yolov5l-coco-torch
yolov5m-coco-torch
yolov5n-coco-torch
yolov5s-coco-torch
yolov5x-coco-torch
yolov8l-coco-torch
yolov8l-obb-dotav1-torch
yolov8l-oiv7-torch
yolov8l-seg-coco-torch
yolov8l-world-torch
yolov8m-coco-torch
yolov8m-obb-dotav1-torch
yolov8m-oiv7-torch
yolov8m-seg-coco-torch
yolov8m-world-torch
yolov8n-coco-torch
yolov8n-obb-dotav1-torch
yolov8n-oiv7-torch
yolov8n-seg-coco-torch
yolov8s-coco-torch
yolov8s-obb-dotav1-torch
yolov8s-oiv7-torch
yolov8s-seg-coco-torch
yolov8s-world-torch
yolov8x-coco-torch
yolov8x-obb-dotav1-torch
yolov8x-oiv7-torch
yolov8x-seg-coco-torch
yolov8x-world-torch
yolov9c-coco-torch
yolov9c-seg-coco-torch
yolov9e-coco-torch
yolov9e-seg-coco-torch
zero-shot-classification-transformer-torch
zero-shot-detection-transformer-torch

Models that are still problematic

yolov5 - loads on the coda:0 before being loaded to the device in the argument. Not sure why
hugging face zero shot transformers - expect input_ids argument that isn't passed. probably unrelated error.
open-clip-torch - self._preprocess is a bool instead of a callable. not sure why. probably unrelated.

models that haven't been tested - need to setup env

med-sam-2-video-torch - Model is not an image model
segment-anything-2-hiera-base-plus-video-torch - Model is not an image model
segment-anything-2-hiera-large-video-torch - Model is not an image model
segment-anything-2-hiera-small-video-torch - Model is not an image model
segment-anything-2-hiera-tiny-video-torch - Model is not an image model
segment-anything-2.1-hiera-base-plus-video-torch - Model is not an image model
segment-anything-2.1-hiera-large-video-torch - Model is not an image model
segment-anything-2.1-hiera-small-video-torch - Model is not an image model
segment-anything-2.1-hiera-tiny-video-torch - Model is not an image model
segment-anything-vitb-torch - Failed to load model
segment-anything-vith-torch - Failed to load model
segment-anything-vitl-torch - Failed to load model
yolo-nas-torch - Failed to load model

@brimoor
Copy link
Contributor

brimoor commented Jan 9, 2025

Adding @manushreegangwar and @mwoodson1 as ML team reviewers 😄

@brimoor
Copy link
Contributor

brimoor commented Jan 9, 2025

@jacobsela can you rebase on latest develop? Looks like these is a merge conflict that would currently prevent merging this.

Also:

yolov5 - loads on the coda:0 before being loaded to the device in the argument. Not sure why

We're using Ultralytics' model here. Can anything be done to address this?

hugging face zero shot transformers - expect input_ids argument that isn't passed. probably unrelated error.

On develop on macOS with CPU, these work for me. Are you seeing something different?

open-clip-torch - self._preprocess is a bool instead of a callable. not sure why. probably unrelated.

On develop on macOS with CPU, this works for me. Are you seeing something different?

@danielgural
Copy link
Contributor

I have some scripts sitting around that can test. I will do Mac CPU + MPS (I have M4) and multi GPU. Will kick off runs tonight and hopefully will finish before morning. Will bring back findings

@jacobsela
Copy link
Contributor Author

jacobsela commented Jan 9, 2025

Resolved the yolo5 issue. When loading from torch hub, it will automatically load the model onto the currently set default device (which is "cuda" when working in a cuda enabled environment). Wrapping line 788 in fiftyone.utils.torch in a with torch.device("cpu") fixes this while maintaining the default device in other parts of the code.

edit: Can't reproduce...

@jacobsela
Copy link
Contributor Author

I'm just going to pass "cpu" to always be the device in the manifest. model is moved to correct device afterwards.

@jacobsela jacobsela force-pushed the bugfix/zoo-clip-support-for-multi-gpu-setups branch from 06ead81 to fb7b179 Compare January 10, 2025 00:27
@jacobsela
Copy link
Contributor Author

jacobsela commented Jan 10, 2025

status:

  1. Rebased locally and force pushed, git merge --no-commit --no-ff bugfix/zoo-clip-support-for-multi-gpu-setups from an up-to-date develop gives no errors. I'm not sure if this was the proper way of doing this.
  2. Fixed the yolov5 issue by updating manifests to by default use CPU. In general it seems that there is no way to pass arguments directly into the entrypoint_fcn of a zoo model from the code, just the manifest. Am I missing something?
  3. For the zero-shot transformers I'm getting the same error ValueError: You have to specify input_ids on cpu in develop. Something may be broken on my env.
  4. SAM2 models have some internal component that defaults to loading on cuda:0. I don't know if it's just my environment or a general thing. I don't know if this is worth the trouble of debugging. Assuming the user has a functioning default "cuda" device with enough free memory they probably wouldn't even notice.

TL;DR
Models that are still up in the air:

  • video models
  • yolo-nas (getting error when downloading weights)
  • open-clip-torch (can't get it to run)
  • zero shot transformers (can't get them to run)

Works but for whatever reason loads on "cuda" before going to desired device:

  • sam 2 models

@danielgural
Copy link
Contributor

MPS works on all but some transformers due to an aten::upsample_bicubic2d.out operator. Error spits out correctly as "not supported on MPS yet" from torch.

Multi GPU works except for zero-shot-classification-transformer-torch on device cuda

Traceback (most recent call last):
  File "/home/dan/model_testing/jacob_test.py", line 158, in test_all_torch_image_models
    self._test_image_model(fo_torch_model, model_name, device, input_format=input_format)
  File "/home/dan/model_testing/jacob_test.py", line 90, in _test_image_model
    raise Exception(f"Failed to run model {model_name} on device {device}") from e
Exception: Failed to run model zero-shot-classification-transformer-torch on device cuda

+1 to clip input_ids issue.

LGTM for my tests just needs the stated above fixes

@jacobsela
Copy link
Contributor Author

jacobsela commented Jan 16, 2025

open-clip update

The preprocessor of open clip is loaded in line 92 of fiftyone.utils.open_clip in the method _load_model of the class TorchOpenClipModel.

This method is called by fiftyone.utils.torch.TorchImageModel in the __init__ in line 538. It is then set to True a few lines later in line 544.

Even when this issue is fixed by saving the preprocessor in an auxiliary variable and setting self.preprocess at the end of TorchOpenClipModel's __init__, self.predict doesn't work with any of the possible input types in the contract defined by TorchImageModel (np array, pil image, torch tensor). The actual open clip model expects PIL image input ruling out numpy and torch inputs.

Even when self.predict is given PIL image input, the unsqueeze(0) in line 138 of fiftyone.utils.open_clip followed by the stack in line 141 creates an extra dimension that causes an error.

Fixing all of these issues fixes the problem in my env.

PR: #5395

@jacobsela
Copy link
Contributor Author

@danielgural @mwoodson1 @manushreegangwar Given the fact that most models are working, I suggest we currently merge this PR as it is and open separate tickets for the other issues, e.g. #5395

Let me know what you think.

@jacobsela
Copy link
Contributor Author

Leaving this test script here in case we need it again:
https://github.com/voxel51/test_zoo_models_script

@brimoor
Copy link
Contributor

brimoor commented Jan 16, 2025

@jacobsela can you retarget this PR at release/v1.3.0 so we can include these fixes in the next release?

@jacobsela jacobsela changed the base branch from develop to release/v1.3.0 January 16, 2025 16:09
Copy link
Contributor

@manushreegangwar manushreegangwar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jacobsela jacobsela merged commit 070ccae into release/v1.3.0 Jan 17, 2025
14 checks passed
@jacobsela jacobsela deleted the bugfix/zoo-clip-support-for-multi-gpu-setups branch January 17, 2025 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] apply_model fails on multi-GPU due to hardcoded CUDA device
6 participants