Merge pull request #33 from roboflow/feature/foundations_of_training

maestro Florence-2 fine-tuning
roboflow · Sep 11, 2024 · ccd268c · ccd268c
2 parents 20933c6 + 3a82c11
commit ccd268c
Show file tree

Hide file tree

Showing 36 changed files with 1,921 additions and 2,031 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,21 @@
+repos:
+-   repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v2.3.0
+    hooks:
+    -   id: check-yaml
+    -   id: end-of-file-fixer
+    -   id: trailing-whitespace
+-   repo: https://github.com/psf/black
+    rev: 24.8.0
+    hooks:
+    -   id: black
+        args: [--line-length=120]
+-   repo: https://github.com/pre-commit/mirrors-mypy
+    rev: v1.11.2
+    hooks:
+    -   id: mypy
+-   repo: https://github.com/PyCQA/flake8
+    rev: 7.1.1
+    hooks:
+    -   id: flake8
+        args: [--max-line-length=120]
diff --git a/README.md b/README.md
@@ -1,142 +1,61 @@
-
 <div align="center">
 
-  <h1>multimodal-maestro</h1>
-
-  <br>
+  <h1>maestro</h1>
 
-  [![version](https://badge.fury.io/py/maestro.svg)](https://badge.fury.io/py/maestro)
-  [![license](https://img.shields.io/pypi/l/maestro)](https://github.com/roboflow/multimodal-maestro/blob/main/LICENSE)
-  [![python-version](https://img.shields.io/pypi/pyversions/maestro)](https://badge.fury.io/py/maestro)
-  [![Gradio](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/Roboflow/SoM)
-  [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/roboflow/multimodal-maestro/blob/develop/cookbooks/multimodal_maestro_gpt_4_vision.ipynb)
+  <p>coming: when it's ready...</p>
 
 </div>
 
 ## 👋 hello
 
-Multimodal-Maestro gives you more control over large multimodal models to get the 
-outputs you want. With more effective prompting tactics, you can get multimodal models 
-to do tasks you didn't know (or think!) were possible. Curious how it works? Try our 
-[HF space](https://huggingface.co/spaces/Roboflow/SoM)!
+**maestro** is a tool designed to streamline and accelerate the fine-tuning process for 
+multimodal models. It provides ready-to-use recipes for fine-tuning popular 
+vision-language models (VLMs) such as **Florence-2**, **PaliGemma**, and 
+**Phi-3.5 Vision** on downstream vision-language tasks.
 
 ## 💻 install
 
-⚠️ Our package has been renamed to `maestro`. Install the package in a
-[**3.11>=Python>=3.8**](https://www.python.org/) environment.
+Pip install the supervision package in a
+[**Python>=3.8**](https://www.python.org/) environment.
 
 ```bash
 pip install maestro
 ```
 
-## 🔌 API
+## 🔥 quickstart
 
-🚧 The project is still under construction. The redesigned API is coming soon.
+### CLI
 
-![maestro-docs-Snap](https://github.com/roboflow/multimodal-maestro/assets/26109316/a787b7c0-527e-465a-9ca9-d46f4d63ea53)
+VLMs can be fine-tuned on downstream tasks directly from the command line with 
+`maestro` command:
 
-## 🧑‍🍳 prompting cookbooks
+```bash
+maestro florence2 train --dataset='<DATASET_PATH>' --epochs=10 --batch-size=8
+```
 
-| Description                                                     | Colab                                                                                                                                                                                                   |
-|:----------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
-| Prompt LMMs with Multimodal Maestro                             | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/roboflow/multimodal-maestro/blob/develop/cookbooks/multimodal_maestro_gpt_4_vision.ipynb) |
-| Manually annotate ONE image and let GPT-4V annotate ALL of them | [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/roboflow/multimodal-maestro/blob/develop/cookbooks/grounding_dino_and_gpt4_vision.ipynb)  |
+### SDK
 
+Alternatively, you can fine-tune VLMs using the Python SDK, which accepts the same 
+arguments as the CLI example above:
 
-## 🚀 example
+```python
+from maestro.trainer.common import MeanAveragePrecisionMetric
+from maestro.trainer.models.florence_2 import train, TrainingConfiguration
 
-```
-Find dog.
+config = TrainingConfiguration(
+    dataset='<DATASET_PATH>',
+    epochs=10,
+    batch_size=8,
+    metrics=[MeanAveragePrecisionMetric()]
+)
 
->>> The dog is prominently featured in the center of the image with the label [9].
+train(config)
 ```
 
-<details close>
-<summary>👉 read more</summary>
-
-<br>
-
-- **load image**
-
-  ```python
-  import cv2
-
-  image = cv2.imread("...")
-  ```
-
-- **create and refine marks**
-
-  ```python
-  import maestro
-
-  generator = maestro.SegmentAnythingMarkGenerator(device='cuda')
-  marks = generator.generate(image=image)
-  marks = maestro.refine_marks(marks=marks)
-  ```
-
-- **visualize marks**
-
-  ```python
-  mark_visualizer = maestro.MarkVisualizer()
-  marked_image = mark_visualizer.visualize(image=image, marks=marks)
-  ```
-  ![image-vs-marked-image](https://github.com/roboflow/multimodal-maestro/assets/26109316/92951ed2-65c0-475a-9279-6fd344757092)
-
-- **prompt**
-
-  ```python
-  prompt = "Find dog."
-
-  response = maestro.prompt_image(api_key=api_key, image=marked_image, prompt=prompt)
-  ```
-
-  ```
-  >>> "The dog is prominently featured in the center of the image with the label [9]."
-  ```
-
-- **extract related marks**
-
-  ```python
-  masks = maestro.extract_relevant_masks(text=response, detections=refined_marks)
-  ```
-
-  ```
-  >>> {'6': array([
-  ...     [False, False, False, ..., False, False, False],
-  ...     [False, False, False, ..., False, False, False],
-  ...     [False, False, False, ..., False, False, False],
-  ...     ...,
-  ...     [ True,  True,  True, ..., False, False, False],
-  ...     [ True,  True,  True, ..., False, False, False],
-  ...     [ True,  True,  True, ..., False, False, False]])
-  ... }
-  ```
-
-</details>
-
-![multimodal-maestro](https://github.com/roboflow/multimodal-maestro/assets/26109316/c04f2b18-2a1d-4535-9582-e5d3ec0a926e)
-
-## 🚧 roadmap
-
-- [ ] Rewriting the `maestro` API.
-- [ ] Update [HF space](https://huggingface.co/spaces/Roboflow/SoM).
-- [ ] Documentation page.
-- [ ] Add GroundingDINO prompting strategy.
-- [ ] CovVLM demo.
-- [ ] Qwen-VL demo.
-
-## 💜 acknowledgement
-
-- [Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding
-in GPT-4V](https://arxiv.org/abs/2310.11441) by Jianwei Yang, Hao Zhang, Feng Li, Xueyan
-Zou, Chunyuan Li, Jianfeng Gao.
-- [The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)](https://arxiv.org/abs/2309.17421)
-by Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, 
-Lijuan Wang
-
 ## 🦸 contribution
 
-We would love your help in making this repository even better! If you noticed any bug, 
-or if you have any suggestions for improvement, feel free to open an 
+We would love your help in making this repository even better! We are especially 
+looking for contributors with experience in fine-tuning vision-language models (VLMs). 
+If you notice any bugs or have suggestions for improvement, feel free to open an 
 [issue](https://github.com/roboflow/multimodal-maestro/issues) or submit a 
 [pull request](https://github.com/roboflow/multimodal-maestro/pulls).
diff --git a/maestro/cli/__init__.py b/maestro/cli/__init__.py
@@ -0,0 +1 @@
+
diff --git a/maestro/cli/env.py b/maestro/cli/env.py
@@ -0,0 +1,2 @@
+DISABLE_RECIPE_IMPORTS_WARNINGS_ENV = "DISABLE_RECIPE_IMPORTS_WARNINGS"
+DEFAULT_DISABLE_RECIPE_IMPORTS_WARNINGS_ENV = "False"
diff --git a/maestro/cli/introspection.py b/maestro/cli/introspection.py
@@ -0,0 +1,37 @@
+import os
+
+import typer
+
+from maestro.cli.env import DISABLE_RECIPE_IMPORTS_WARNINGS_ENV, \
+    DEFAULT_DISABLE_RECIPE_IMPORTS_WARNINGS_ENV
+from maestro.cli.utils import str2bool
+
+
+def find_training_recipes(app: typer.Typer) -> None:
+    try:
+        from maestro.trainer.models.florence_2.entrypoint import florence_2_app
+
+        app.add_typer(florence_2_app, name="florence2")
+    except Exception:
+        _warn_about_recipe_import_error(model_name="Florence 2")
+
+    try:
+        from maestro.trainer.models.paligemma.entrypoint import paligemma_app
+
+        app.add_typer(paligemma_app, name="paligemma")
+    except Exception:
+        _warn_about_recipe_import_error(model_name="PaliGemma")
+
+
+def _warn_about_recipe_import_error(model_name: str) -> None:
+    disable_warnings = str2bool(
+        os.getenv(
+            DISABLE_RECIPE_IMPORTS_WARNINGS_ENV,
+            DEFAULT_DISABLE_RECIPE_IMPORTS_WARNINGS_ENV,
+        )
+    )
+    if disable_warnings:
+        return None
+    warning = typer.style("WARNING", fg=typer.colors.RED, bold=True)
+    message = "🚧 " + warning + f" cannot import recipe for {model_name}"
+    typer.echo(message)
diff --git a/maestro/cli/main.py b/maestro/cli/main.py
@@ -0,0 +1,15 @@
+import typer
+
+from maestro.cli.introspection import find_training_recipes
+
+app = typer.Typer()
+find_training_recipes(app=app)
+
+
+@app.command(help="Display information about maestro")
+def info():
+    typer.echo("Welcome to maestro CLI. Let's train some VLM! 🏋")
+
+
+if __name__ == "__main__":
+    app()
diff --git a/maestro/cli/utils.py b/maestro/cli/utils.py
@@ -0,0 +1,2 @@
+def str2bool(value: str) -> bool:
+    return value.lower() in {"y", "t", "yes", "true"}
diff --git a/maestro/trainer/__init__.py b/maestro/trainer/__init__.py
diff --git a/maestro/trainer/common/__init__.py b/maestro/trainer/common/__init__.py
@@ -0,0 +1 @@
+from maestro.trainer.common.utils.metrics import MeanAveragePrecisionMetric
diff --git a/maestro/trainer/common/configuration/__init__.py b/maestro/trainer/common/configuration/__init__.py
diff --git a/maestro/trainer/common/configuration/env.py b/maestro/trainer/common/configuration/env.py
@@ -0,0 +1,5 @@
+SEED_ENV = "SEED"
+DEFAULT_SEED = "42"
+CUDA_DEVICE_ENV = "CUDA_DEVICE"
+DEFAULT_CUDA_DEVICE = "cuda:0"
+HF_TOKEN_ENV = "HF_TOKEN"
diff --git a/maestro/trainer/common/data_loaders/__init__.py b/maestro/trainer/common/data_loaders/__init__.py
diff --git a/maestro/trainer/common/data_loaders/datasets.py b/maestro/trainer/common/data_loaders/datasets.py
@@ -0,0 +1,50 @@
+import json
+import os
+from typing import List, Dict, Any, Tuple
+
+from PIL import Image
+from transformers.pipelines.base import Dataset
+
+
+class JSONLDataset:
+    def __init__(self, jsonl_file_path: str, image_directory_path: str):
+        self.jsonl_file_path = jsonl_file_path
+        self.image_directory_path = image_directory_path
+        self.entries = self._load_entries()
+
+    def _load_entries(self) -> List[Dict[str, Any]]:
+        entries = []
+        with open(self.jsonl_file_path, "r") as file:
+            for line in file:
+                data = json.loads(line)
+                entries.append(data)
+        return entries
+
+    def __len__(self) -> int:
+        return len(self.entries)
+
+    def __getitem__(self, idx: int) -> Tuple[Image.Image, Dict[str, Any]]:
+        if idx < 0 or idx >= len(self.entries):
+            raise IndexError("Index out of range")
+
+        entry = self.entries[idx]
+        image_path = os.path.join(self.image_directory_path, entry["image"])
+        try:
+            image = Image.open(image_path)
+            return (image, entry)
+        except FileNotFoundError:
+            raise FileNotFoundError(f"Image file {image_path} not found.")
+
+
+class DetectionDataset(Dataset):
+    def __init__(self, jsonl_file_path: str, image_directory_path: str):
+        self.dataset = JSONLDataset(jsonl_file_path, image_directory_path)
+
+    def __len__(self):
+        return len(self.dataset)
+
+    def __getitem__(self, idx):
+        image, data = self.dataset[idx]
+        prefix = data["prefix"]
+        suffix = data["suffix"]
+        return prefix, suffix, image
diff --git a/maestro/trainer/common/data_loaders/jsonl.py b/maestro/trainer/common/data_loaders/jsonl.py
@@ -0,0 +1,31 @@
+from __future__ import annotations
+
+import random
+from typing import List
+
+from torch.utils.data import Dataset
+
+from maestro.trainer.common.utils.file_system import read_jsonl
+
+
+class JSONLDataset(Dataset):
+    # TODO: implementation could be better - avoiding loading
+    #  whole files to memory
+
+    @classmethod
+    def from_jsonl_file(cls, path: str) -> JSONLDataset:
+        file_content = read_jsonl(path=path)
+        random.shuffle(file_content)
+        return cls(jsons=file_content)
+
+    def __init__(self, jsons: List[dict]):
+        self.jsons = jsons
+
+    def __getitem__(self, index):
+        return self.jsons[index]
+
+    def __len__(self):
+        return len(self.jsons)
+
+    def shuffle(self):
+        random.shuffle(self.jsons)
diff --git a/maestro/trainer/common/utils/__init__.py b/maestro/trainer/common/utils/__init__.py
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		DISABLE_RECIPE_IMPORTS_WARNINGS_ENV = "DISABLE_RECIPE_IMPORTS_WARNINGS"
		DEFAULT_DISABLE_RECIPE_IMPORTS_WARNINGS_ENV = "False"
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		def str2bool(value: str) -> bool:
		return value.lower() in {"y", "t", "yes", "true"}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		from maestro.trainer.common.utils.metrics import MeanAveragePrecisionMetric