Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

maestro Florence-2 fine-tuning #33

Merged
merged 76 commits into from
Sep 11, 2024
Merged

Conversation

SkalskiP
Copy link
Collaborator

@SkalskiP SkalskiP commented Sep 4, 2024

  • README.md update
  • maestro CLI with train and evaluate commands
  • Florence-2 fine-tuning
  • MeanAveragePrecisionMetric
  • saving best and latest checpoints
  • tracking and saving metrics

Copy link

@SangbumChoi SangbumChoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks well prepared. Since it is on-going project left some common question here, because I'm still learning the overall pipeline and code-style that roboflow team has made. (Luckly, it seems very similar to transformers pipeline)

  1. does training.py in florence_2 is missing or on-going? (while paligemma has training.py)
  2. Definitely consider multi-gpu circumstances when thinking of real user scenario.

This is my on-going Zero-shot Object detection pipeline in HuggingFace.
huggingface/transformers#32483

num_workers=config.num_workers,
test_loaders_workers=config.val_num_workers,
)
peft_model = prepare_peft_model(
Copy link

@SangbumChoi SangbumChoi Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approaching with peft is also good way to start. FYI, I have tried three different technique

  1. Full finetuning
  2. Part finetuning (freezing encoder-like part)
  3. peft

it turns out 2, 3 is robust for other hyperparameter option and I couldn't fine any stable configuration for 1.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have 1/2 yet. I'm just wondering how to solve 2. In theory, users might want to freeze larger/smaller parts of the graph. Do you think such flexibility might be useful or can we just offer a pre-defined freeze?

Copy link

@SangbumChoi SangbumChoi Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now supporting only peft would be enough (which means not the highest priority), since this is the moment of just starting to make the growth. If there is retention or other users inquiry then we can support from that moment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so too! For the time being, we have to remember that such an option may arise at some point.

# Postprocess prediction for mean average precision calculation
prediction = processor.post_process_generation(generated_text, task="<OD>", image_size=image.size)
prediction = sv.Detections.from_lmm(sv.LMM.FLORENCE_2, prediction, resolution_wh=image.size)
prediction = prediction[np.isin(prediction["class_name"], classes)]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree that this is one option to choose text-based output to calculate traditional OD.
However, some prediction is very close to the classes, but this np.isin will not able to catch.
e.g. prediction : apple, groundtruth : apples.

I also considered to use calculate the distance vectorized text embedding or other heuristic method such as CIDEr for to make more robust. It would be great to consider metrics in VLM. e.g. CIDEr, BLUE, etc...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree that this is one option to choose text-based output to calculate traditional OD.
However, some prediction is very close to the classes, but this np.isin will not able to catch.
e.g. prediction : apple, groundtruth : apples.

Good catch! I've experienced that myself. I don't have the time to address it right now, but I'll add a task for it. Maybe one of the external contributors would like to implement this feature.chciałby zaimplementować ten feature.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also considered to use calculate the distance vectorized text embedding or other heuristic method such as CIDEr for to make more robust. It would be great to consider metrics in VLM. e.g. CIDEr, BLUE, etc...

Do you have any resources (papers) where I could read about alternative metrics?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with torch.amp.autocast(device.type, torch.float16):
lora_layers = filter(lambda p: p.requires_grad, peft_model.parameters())
optimizer = optim.SGD(lora_layers, lr=learning_rate)
scheduler = optim.lr_scheduler.CosineAnnealingLR(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason using CosineAnnealingLR?

@SkalskiP
Copy link
Collaborator Author

Overall it looks well prepared. Since it is on-going project left some common question here, because I'm still learning the overall pipeline and code-style that roboflow team has made. (Luckly, it seems very similar to transformers pipeline)

  1. does training.py in florence_2 is missing or on-going? (while paligemma has training.py)
  2. Definitely consider multi-gpu circumstances when thinking of real user scenario.

This is my on-going Zero-shot Object detection pipeline in HuggingFace. huggingface/transformers#32483

Hi @SangbumChoi 👋🏻 First of all, thank you so much for taking the time to look at the code.

  1. Originally, we planned to deliver recipes for two foundational models - Florence2 and PaliGemma. However, during the process, we realized that making PaliGemma work is harder to fine-tune. The Florence2 codebase is definitely more mature, so if you see any differences between Florence2 and PaliGemma, you can be almost certain that we'll ultimately do it the Florence2 way.

  2. Do you have any experience setting up training in transformers on multiple GPUs?

@SkalskiP SkalskiP marked this pull request as ready for review September 11, 2024 20:32
@SkalskiP SkalskiP changed the title WIP: foundations of training maestro Florence-2 fine-tuning Sep 11, 2024
@SkalskiP SkalskiP merged commit ccd268c into develop Sep 11, 2024
1 check passed
@SangbumChoi
Copy link

SangbumChoi commented Sep 12, 2024

@SkalskiP

  1. I think it is also good to have similar codebase of PaliGemma and Florence2 (It might not be possible, but let me brainstrom it).
  2. Yes I have, and I always use multiple GPUs for traditional OD task also. (Never use single GPU)

Since this PR is merged let me run this repo and discuss in slack!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants