diff --git a/index.html b/index.html index aae300e..8c04d23 100644 --- a/index.html +++ b/index.html @@ -1,345 +1,355 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
-
-
-
- -

ActFormer: Scalable Collaborative Perception via Active - Queries

- -
- - Suozhi Huang*2, - - - Juexiao Zhang*1, - - - Yiming Li*1, - - - Chen - Feng1 - - -
- -
- 1New York University -
-
- 2Tsinghua University, work done during intern in - NYU -
- -
- - -
-
-
-
-
-
- - -
-
- - - - - - -
-
-

Abstract

-
-

- Collaborative perception leverages rich visual observations from multiple robots to extend a - single robot's perception ability beyond its field of view. - Many prior works receive messages broadcast from all collaborators, leading to a scalability - challenge when dealing with a large number of robots and sensors. - - In this work, we aim to address scalable camera-based collaborative perception with a - Transformer-based architecture. Our key idea is to enable a single robot to intelligently - discern the relevance of the collaborators and their associated cameras according to a - learned spatial prior. This proactive understanding of the visual features' relevance does - not require the transmission of the features themselves, enhancing both communication and - computation efficiency. Specifically, we present ActFormer, a Transformer that learns bird's - eye view (BEV) representations by using predefined BEV queries to interact with multi-robot - multi-camera inputs. Each BEV query can actively select relevant cameras for information - aggregation based on pose information, instead of interacting with all cameras - indiscriminately. Experiments on the V2X-Sim dataset demonstrate that ActFormer improves the - detection performance from 29.89\% to 45.15\% in terms of AP@0.7 with about 50\% fewer - queries, showcasing the effectiveness of ActFormer in multi-agent collaborative 3D object - detection. -

-
-
-
- -
-
- -
-
- - -
-
-

Contribution

-
-
  • - We conceptualize a scalable and efficient collaborative perception framework that can - actively and intelligently identify the most relevant sensory measurements based on spatial - knowledge, without relying on the sensory measurements themselves. -
  • -
  • - We ground the concept of the scalable collaborative perception with a Transformer, - \textit{i.e.}, \textit{ActFormer}, which uses a group of 3D-to-2D BEV queries to actively - and efficiently aggregate the features from multi-robot multi-camera input, only relying on - pose information. -
  • -
  • - We conduct comprehensive experiments in the task of collaborative object detection to verify - the effectiveness and efficiency of our ActFormer.. - -
  • - -
    -
    -
    - -
    -
    - -
    -
    - -
    -
    -
    -

    Method

    -
    - -
    -
    -

    - - - Our motivation stems from the idea that how vehicles collaboratively perceive should be - closely related to their relative poses. Different camera poses result in varying - viewpoints, each capturing unique information. However, conventional collaborative methods - often treat all viewpoints equally, overlooking the fact that these camera perspectives - offer different insights into the environment—some unique, some overlapping, and some - redundant. Consequently, the ego vehicle may not fully capitalize on the diverse - perspectives available, leading to indiscriminate collaboration that generates excessive - communication and computation. Actually, communication may not be necessary when some - partners share very similar observations. -

    -
    -
    -
    -
    - - - - - - - - - - - -
    - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    +
    +
    +
    +
    + +

    ActFormer: Scalable Collaborative Perception via Active + Queries

    + +
    + + Suozhi Huang*2, + + + Juexiao Zhang*1, + + + Yiming Li*1, + + + Chen + Feng1 + + +
    + +
    + 1New York University +
    +
    + 2Tsinghua University, work done during intern in + NYU +
    + +
    + + +
    +
    +
    +
    +
    +
    + + +
    +
    + + + + + + +
    +
    +

    Abstract

    +
    +

    + Collaborative perception leverages rich visual observations from multiple robots to extend a + single robot's perception ability beyond its field of view. + Many prior works receive messages broadcast from all collaborators, leading to a scalability + challenge when dealing with a large number of robots and sensors. + + In this work, we aim to address scalable camera-based collaborative perception with a + Transformer-based architecture. Our key idea is to enable a single robot to intelligently + discern the relevance of the collaborators and their associated cameras according to a + learned spatial prior. This proactive understanding of the visual features' relevance does + not require the transmission of the features themselves, enhancing both communication and + computation efficiency. Specifically, we present ActFormer, a Transformer that learns bird's + eye view (BEV) representations by using predefined BEV queries to interact with multi-robot + multi-camera inputs. Each BEV query can actively select relevant cameras for information + aggregation based on pose information, instead of interacting with all cameras + indiscriminately. Experiments on the V2X-Sim dataset demonstrate that ActFormer improves the + detection performance from 29.89\% to 45.15\% in terms of AP@0.7 with about 50\% fewer + queries, showcasing the effectiveness of ActFormer in multi-agent collaborative 3D object + detection. +

    +
    +
    +
    + +
    +
    + +
    +
    + + +
    +
    +

    Contribution

    +
    +
  • + We conceptualize a scalable and efficient collaborative perception framework that can + actively and intelligently identify the most relevant sensory measurements based on spatial + knowledge, without relying on the sensory measurements themselves. +
  • +
  • + We ground the concept of the scalable collaborative perception with a Transformer, + \textit{i.e.}, \textit{ActFormer}, which uses a group of 3D-to-2D BEV queries to actively + and efficiently aggregate the features from multi-robot multi-camera input, only relying on + pose information. +
  • +
  • + We conduct comprehensive experiments in the task of collaborative object detection to verify + the effectiveness and efficiency of our ActFormer.. + +
  • + +
    +
    +
    + +
    +
    + +
    +
    + +
    +
    +
    +

    Method

    +
    + +
    +
    +

    + + + Our motivation stems from the idea that how vehicles collaboratively perceive should be + closely related to their relative poses. Different camera poses result in varying + viewpoints, each capturing unique information. However, conventional collaborative methods + often treat all viewpoints equally, overlooking the fact that these camera perspectives + offer different insights into the environment—some unique, some overlapping, and some + redundant. Consequently, the ego vehicle may not fully capitalize on the diverse + perspectives available, leading to indiscriminate collaboration that generates excessive + communication and computation. Actually, communication may not be necessary when some + partners share very similar observations. +

    +
    +
    +
    +
    + + + + + + + + + + + +
    + + + + + \ No newline at end of file