Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jsk_perception] Add Visual Grounding with OFA #2797

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 89 additions & 0 deletions doc/jsk_perception/nodes/detection_node.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# detection_node.py

![](images/dino.png)

The ROS node for Open-Vocabulary Object Detection with GroundingDINO.

## System Configuration
![](images/large_scale_vil_system.png)

This node requires to work with the Docker Container for inference. Please build the container at first following Setup instruction.

### Prerequisite
This node requires NVIDIA GPU and more than 4GB GRAM to work properly.
You have to install nvidia-container-toolkit for using GPU with docker. Please follow [official instruction](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html).

### Build the docker image
You have to build the docker image of GroundingDINO

```shell
roscd jsk_perception/docker
make
```

## Subscribing topic
* `~image` (`sensor_msgs/Image`)

Input image

## Publishing topic
* `~output/image` (`sensor_msgs/Image`)

Image drawing the detected bounding box

* `~rects` (`jsk_recognition_msgs/RectArray`)

Array of detected bounding box regions

* `~result` (`jsk_recognition_msgs/DetectionResult`)

Detection result

* `~result/image` (`sensor_msgs/Image`)

Images used for inference

* `~visualize` (`std_msgs/String`)

Detection result to visualize

## Action topic
* `~inference_server/goal` (`jsk_recognition_msgs/DetectionTaskActionGoal`)

Detection request with custom categories and image

* `~inference_server/result` (`jsk_recognition_msgs/DetectionTaskActionResult`)

Detection result of `~inference_server/goal`

## Parameters
* `~host` (String, default: `localhost`)

The host name or IP of inference container

* `~port` (Integer, default: `8080`)

The HTTP port of inference container

## Dynamic Reconfigure Parameters
* `~queries` (string, default: `human;kettle;cup;glass`)

Default categories used for subscribing image topic.

### Run inference container on another host or another terminal
In the remote GPU machine,
```shell
cd jsk_recognition/jsk_perception/docker
./run_jsk_vil_api dino --port (Your vacant port)
```

In the ROS machine,
```shell
roslaunch jsk_perception detection.launch port:=(Your inference container port) host:=(Your inference container host) DETECTION_INPUT_IMAGE:=(Your image topic name) gui:=true
```


### Run both inference container and ros node in single host
```
roslaunch jsk_perception detection.launch run_api:=true DETECTION_INPUT_IMAGE:=(Your image topic name) gui:=true
```
Binary file added doc/jsk_perception/nodes/images/dino.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 7 additions & 2 deletions jsk_perception/docker/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,11 @@
# api directories
OFAPROJECT = ofa
CLIPPROJECT = clip
DINOPROJECT = dino
# image names
OFAIMAGE = jsk-ofa-server
CLIPIMAGE = jsk-clip-server
DINOIMAGE = jsk-dino-server
# commands
BUILDIMAGE = docker build
REMOVEIMAGE = docker rmi
Expand All @@ -23,7 +25,7 @@ PARAMURLS = parameter_urls.txt
# OFA parameters
OFAPARAMFILES = $(foreach param, $(OFAPARAMS), $(PARAMDIR)/$(param))

all: ofa clip
all: ofa clip dino

# TODO check command wget exists, nvidia-driver version

Expand All @@ -41,11 +43,14 @@ ofa: $(PARAMDIR)/.download
clip: $(PARAMDIR)/.download
$(BUILDIMAGE) $(CLIPPROJECT) -t $(CLIPIMAGE) -f $(CLIPPROJECT)/Dockerfile

dino: $(PARAMDIR)/.download
$(BUILDIMAGE) $(DINOPROJECT) -t $(DINOIMAGE) -f $(DINOPROJECT)/Dockerfile

# TODO add clip, glip
clean:
@$(REMOVEIMAGE) $(OFAIMAGE)

wipe: clean
rm -fr $(PARAMDIR)

.PHONY: clean wipe ofa clip
.PHONY: clean wipe ofa clip dino
27 changes: 27 additions & 0 deletions jsk_perception/docker/dino/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# FROM pytorch/pytorch:1.7.1-cuda11.0-cudnn8-devel
FROM pytorch/pytorch:1.9.1-cuda11.1-cudnn8-devel
# FROm pytorch/pytorch:1.13.1-cuda11.6-cudnn8-devel
ARG DEBIAN_FRONTEND=noninteractive
RUN apt -o Acquire::AllowInsecureRepositories=true update \
&& apt-get install -y \
curl \
git \
libopencv-dev \
wget \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
ENV CUDA_HOME /usr/local/cuda
ENV TORCH_CUDA_ARCH_LIST 8.0+PTX
RUN git clone https://github.com/IDEA-Research/GroundingDINO.git
RUN echo 'export CUDA_HOME=/usr/local/cuda' >> ~/.bashrc
RUN echo 'TORCH_CUDA_ARCH_LIST=8.0+PTX' >> ~/.bashrc
RUN pip install flask opencv-python \
&& pip install "numpy>=1.20"
RUN cd GroundingDINO \
&& pip install -r requirements.txt \
&& pip install -e .
RUN mkdir -p GroundingDINO/weights \
&& cd GroundingDINO/weights \
&& wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
COPY server.py /workspace/GroundingDINO
ENTRYPOINT cd /workspace/GroundingDINO && python server.py
99 changes: 99 additions & 0 deletions jsk_perception/docker/dino/server.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
from groundingdino.util.inference import load_model, load_image, predict, annotate
import groundingdino.datasets.transforms as T
from torchvision.ops import box_convert

import cv2
import numpy as np
from PIL import Image as PLImage
import torch

# web server
from flask import Flask, request, Response
import json
import base64


def apply_half(t):
if t.dtype is torch.float32:
return t.to(dtype=torch.half)
return t

class Inference:
def __init__(self, gpu_id=None):
self.gpu_id = gpu_id
self.device = "cuda" if torch.cuda.is_available() else "cpu"
self.model = load_model("groundingdino/config/GroundingDINO_SwinT_OGC.py", "weights/groundingdino_swint_ogc.pth")
self.BOX_TRESHOLD = 0.35
self.TEXT_TRESHOLD = 0.25

def convert_to_string(self, input_list):
output_string = ""
for item in input_list:
output_string += item + " . "
return output_string.strip()

def infer(self, img, texts):
# get cv2 image
# image = cv2.resize(img, dsize=(640, 480)) # NOTE forcely
# image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
image_source = PLImage.fromarray(image)
image = np.asarray(image_source)
transform = T.Compose(
[
T.RandomResize([800], max_size=1333),
T.ToTensor(),
T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
]
)
image_transformed, _ = transform(image_source, None)

image_source = image
image = image_transformed

TEXT_PROMPT = self.convert_to_string(texts)

boxes, logits, phrases = predict(
model=self.model,
image=image,
caption=TEXT_PROMPT,
box_threshold=self.BOX_TRESHOLD,
text_threshold=self.TEXT_TRESHOLD,
device = self.device
)

h, w, _ = image_source.shape
boxes = boxes * torch.Tensor([w, h, w, h])
xyxy = box_convert(boxes=boxes, in_fmt="cxcywh", out_fmt="xyxy").numpy()

results = {}
for i in range(len(xyxy)):
box = xyxy[i].tolist()
logit = logits[i].item()
results[i] = {"box": box, "logit": logit, "phrase": phrases[i]}

return results

# run
if __name__ == "__main__":
app = Flask(__name__)
infer = Inference()

@app.route("/detection", methods=['POST'])
def detection_request():
data = request.data.decode("utf-8")
data_json = json.loads(data)
# process image
image_b = data_json['image']
image_dec = base64.b64decode(image_b)
data_np = np.fromstring(image_dec, dtype='uint8')
img = cv2.imdecode(data_np, 1)
# get text
texts = data_json['queries']
infer_results = infer.infer(img, texts)
results = []
for i in range(len(infer_results)):
results.append({"id": i, "box": infer_results[i]["box"], "logit": infer_results[i]["logit"], "phrase": infer_results[i]["phrase"]})
return Response(response=json.dumps({"results": results}), status=200)

app.run("0.0.0.0", 8080, threaded=True)
58 changes: 54 additions & 4 deletions jsk_perception/docker/ofa/server.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ def __init__(self, task, model_scale):
utils.split_paths(param_path),
arg_overrides=overrides)
elif task == "refcoco":
tasks.register_task(self.task, RefcocoTask)
tasks.register_task(task, RefcocoTask)
self.models, self.cfg, self.task = checkpoint_utils.load_model_ensemble_and_task(
utils.split_paths(param_path),
arg_overrides=overrides)
Expand Down Expand Up @@ -140,6 +140,15 @@ def encode_text(self, text, length=None, append_bos=False, append_eos=False):
s = torch.cat([s, eos_item])
return s

def convert_objects_to_text(self, text):
if len(text) == 1:
object_text = text[0]
elif len(text) >= 2:
object_text = ', '.join(text[:-1]) + f' or {text[-1]}'
else:
object_text = ''
return object_text

def construct_sample(self, image, text):
if self.task_name == "caption" or self.task_name == "vqa_gen":
patch_image = self.patch_resize_transform(image).unsqueeze(0)
Expand Down Expand Up @@ -176,7 +185,8 @@ def construct_sample(self, image, text):
h_resize_ratio = torch.tensor(patch_image_size / h).unsqueeze(0)
patch_image = self.patch_resize_transform(image).unsqueeze(0)
patch_mask = torch.tensor([True])
src_text = self.encode_text(' which region does the text " {} " describe?'.format(text), append_bos=True,
object_text = self.convert_objects_to_text(text)
src_text = self.encode_text(' which region does the text " {} " describe?'.format(object_text), append_bos=True,
append_eos=True).unsqueeze(0)
src_length = torch.LongTensor([s.ne(self.pad_idx).long().sum() for s in src_text])
sample = {
Expand Down Expand Up @@ -214,7 +224,24 @@ def infer(self, img, text):
text = result[0]['answer']
return text
elif self.task_name == "refcoco":
pass
# image = cv2.resize(img, dsize=(640, 480)) # NOTE forcely
# image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
image = Image.fromarray(image)
# Construct input sample & preprocess for GPU if cuda available for VG
sample = self.construct_sample(image, text)
sample = utils.move_to_cuda(sample) if self.use_cuda else sample
sample = utils.apply_to_sample(apply_half, sample) if self.use_fp16 else sample
with torch.no_grad():
result, scores = eval_step(self.task, self.generator, self.models, sample)
results = {}
object_text = self.convert_objects_to_text(text)
for i in range(len(result)):
box = result[i]["box"]
logit = scores[i].item()
results[i] = {"box": box, "logit": logit, "phrase": object_text}

return results

# run
if __name__ == "__main__":
Expand All @@ -232,6 +259,9 @@ def infer(self, img, text):
elif ofa_task == "vqa_gen":
vqa_infer = Inference("vqa_gen", ofa_model_scale)

elif ofa_task == "detection":
detection_infer = Inference("refcoco", ofa_model_scale)

else:
raise RuntimeError("No application is available")

Expand Down Expand Up @@ -274,5 +304,25 @@ def vqa_request():
return Response(response=json.dumps({"results": results}), status=200)
except NameError:
print("Skipping create vqa_gen app")


try:
@app.route("/detection", methods=['POST'])
def detection_request():
data = request.data.decode("utf-8")
data_json = json.loads(data)
# process image
image_b = data_json['image']
image_dec = base64.b64decode(image_b)
data_np = np.fromstring(image_dec, dtype='uint8')
img = cv2.imdecode(data_np, 1)
# get text
texts = data_json['queries']
infer_results = detection_infer.infer(img, texts)
results = []
for i in range(len(infer_results)):
results.append({"id": i, "box": infer_results[i]["box"], "logit": infer_results[i]["logit"], "phrase": infer_results[i]["phrase"]})
return Response(response=json.dumps({"results": results}), status=200)
except NameError:
print("Skipping create detection app")

app.run("0.0.0.0", 8080, threaded=True)
3 changes: 2 additions & 1 deletion jsk_perception/docker/run_jsk_vil_api
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ import subprocess
import sys

CONTAINERS = {"ofa": "jsk-ofa-server",
"clip": "jsk-clip-server"}
"clip": "jsk-clip-server",
"dino": "jsk-dino-server"}
OFA_MODEL_SCALES = ["base", "large", "huge"]

parser = argparse.ArgumentParser(description="JSK Vision and Language API runner")
Expand Down
24 changes: 24 additions & 0 deletions jsk_perception/launch/detection.launch
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
<?xml version="1.0" encoding="utf-8"?>
<launch>
<arg name="host" default="localhost" />
<arg name="port" default="8888" />
<arg name="gui" default="false" />
<arg name="run_api" default="false" />
<arg name="model" default="dino" />
<arg name="DETECTION_INPUT_IMAGE" default="image" />

<node name="detection_api" pkg="jsk_perception" type="run_jsk_vil_api" output="log"
args="$(arg model) -p $(arg port)" if="$(arg run_api)" />

<node name="detection" pkg="jsk_perception" type="detection_node.py" output="screen">
<remap from="~image" to="$(arg DETECTION_INPUT_IMAGE)" />
<rosparam subst_value="true">
host: $(arg host)
port: $(arg port)
model: $(arg model)
</rosparam>
</node>

<include file="$(find jsk_perception)/launch/ofa_gui.launch" if="$(arg gui)" />

</launch>
Loading