Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DINOv2 model slow CPU evaluation #2682

Open
liamwhite opened this issue Dec 27, 2024 · 1 comment
Open

DINOv2 model slow CPU evaluation #2682

liamwhite opened this issue Dec 27, 2024 · 1 comment

Comments

@liamwhite
Copy link

liamwhite commented Dec 27, 2024

Candle is about 10x slower at evaluating this model on the CPU. I have provided a demonstration repository with all the code needed to reproduce.

Output of a typical run of python main.py:

Took 0.12951040267944336 seconds to evaluate

Output of a typical run of target/release/candle_issue_demo:

Took 1.016947847 seconds to evaluate Tensor[dims 1, 1536; f32]

This is unfortunate because loading the model from Rust is much faster than loading it from Python, and would be nice to avoid the need for a server process when running feature extraction on demand.

I tried to keep the gist of the code the same between these, but the Rust version contains two necessary alterations:

  1. The imagenet code from the examples crate is pasted into a module (it probably should be available within the candle_transformers crate, but this is an incredibly minor issue)
  2. The dinov2 code is not designed for the facebook safetensors model which has different parameter names; the most significant difference among these is that qkv is split up into query,key,value. This was addressed by pasting the dinov2 module from DinoV2 & Depth Anything V2: Bigger Models #2288 (c9ed473)

My system specs:
CPU: Ryzen 9 5950X
RAM: 64GB

@LaurentMazare
Copy link
Collaborator

Just to give a few more timings with my Ryzen 9 7950x (32GB memory) and running the inference multiple times.

  • The candle code in the repo runs in 0.33s per iteration. It's weird that it's so much faster than on your box.
  • When activating the mkl feature in all candle crates, runtime goes down to 0.14s per iteration.
  • The pytorch version run takes ~0.11s per iteration.
    Not sure why there is so much of a discrepancy between your box and mine. Also note that the weights are mmap'ed so the first iteration might be slower as the weights might only be copied from the disk to memory there though in practice I don't see much of a difference between iterations on my side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants