Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(autoware_cuda_pointcloud_preprocessor): a cuda-accelerated pointcloud preprocessor #9454

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

knzo25
Copy link
Contributor

@knzo25 knzo25 commented Nov 25, 2024

Description

This PR is part of a series of PRs that aim to accelerate the Sensing/Perception pipeline through an appropriate use of CUDA.

List of PRs:

To use these branches, the following additions to the autoware.repos are necessary:

  vendor/cuda_blackboard:
    type: git
    url: [email protected]:knzo25/cuda_blackboard.git
    version: main
  vendor/negotiated:
    type: git
    url: https://github.com/osrf/negotiated.git
    version: master

Depending on your machine and how many nodes are in a container, the following branch may also be required:
https://github.com/knzo25/launch_ros/tree/fix/load_composable_node
There seems to be a but in ROS where if you send too many services at once some will be lost and ros_launch can not handle that.

Related links

Parent Issue:

  • Link

How was this PR tested?

The sensing/perception pipeline was tested until centerpoint for TIER IV's taxi using the logging simulator.
The following tests were executed in a laptop equipped with a RTX 4060 (laptop) GPU and a Intel(R) Core(TM) Ultra 7 165H (22 cores)

Node / processing time [ms] Current PR
/sensing/lidar/top/crop_box_filter_self/debug/processing_time_ms 5.81 N/A
/sensing/lidar/top/crop_box_filter_mirror/debug/processing_time_ms 4.59 N/A
/sensing/lidar/top/distortion_corrector/debug/processing_time_ms 10.96 N/A
/sensing/lidar/top/ring_outlier_filter/debug/processing_time_ms 10.69 N/A
/sensing/lidar/top/cuda_pointcloud_preprocessor/debug/processing_time_ms N/A 3.08
(2.03 are H->D copies)
/sensing/lidar/concatenate_data_synchronizer/debug/processing_time_ms 7.83 0.70
Total 38.8 3.78

10.26 speedup!

Notes for reviewers

The main branch that I used for development is feat/cuda_acceleration_and_transport_layer.
However, the changes were too big so I split the PRs. That being said, development, if any will still be on that branch (and then cherrypicked to the respective PRs), and the review changes will be cherrypicked into the development branch.

Interface changes

An additional topic is added to perform type negotiation:
Example: input/pointcloud -> input/pointcloud and input/pointcloud/cuda

Effects on system behavior

Enabling this preprocessing in the launchers should provide a much reduced latency and cpu usage (at the cost of a higher GPU usage)

…sonal repository

Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>
@github-actions github-actions bot added type:documentation Creating or refining documentation. (auto-assigned) component:sensing Data acquisition from sensors, drivers, preprocessing. (auto-assigned) tag:require-cuda-build-and-test labels Nov 25, 2024
Copy link

github-actions bot commented Nov 25, 2024

Thank you for contributing to the Autoware project!

🚧 If your pull request is in progress, switch it to draft mode.

Please ensure:

@knzo25 knzo25 self-assigned this Nov 25, 2024
Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>
Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>
Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>
Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>
…pointcloud changes after the first iteration

Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>
Copy link
Contributor

@mojomex mojomex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the amazing PR, these performance improvements are desperately needed.

I haven't checked the PR for functionality yet, but I'll leave my first round of comments here.

The main points I'd like to address are

  • memory safety and idiomatic C++ (there is currently a lot of raw-pointer code which should be avoided whenever possible)
  • modulatiry: currently the pipeline is hard-coded and all in one place. This makes the module hard to adapt to different projects, and hard to maintain individual modules in the pipeline.

Thank you for your time!

sensing/autoware_cuda_pointcloud_preprocessor/README.md Outdated Show resolved Hide resolved
sensing/autoware_cuda_pointcloud_preprocessor/README.md Outdated Show resolved Hide resolved

std::size_t max_ring = 0;

for (std::size_t i = 0; i < input_pointcloud_msg_ptr->width * input_pointcloud_msg_ptr->height;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iteration without explicit bounds checking of the underlying array is not memory-safe. Thus, I would suggest using the abovementioned PointCloud2Iterators here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I ended up removing the adapter since it was heavier than the preprocessing 😅
59144d8 and 5482f9c

num_rings_ = std::max(num_rings_, static_cast<std::size_t>(16));
std::vector<std::size_t> ring_points(num_rings_, 0);

for (std::size_t i = 0; i < input_pointcloud_msg_ptr->width * input_pointcloud_msg_ptr->height;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iteration without explicit bounds checking of the underlying array is not memory-safe. Thus, I would suggest using the abovementioned PointCloud2Iterators here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I ended up removing the adapter since it was heavier than the preprocessing 😅
59144d8 and 5482f9c

max_ring = std::max(max_ring, ring);
}

// Set max rings to the next power of two
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admittedly kind of a niche problem, but not all sensors (Pandar40P) have 2^n rings.

Although auto-detecting the number of rings is nice, it has no hard guarantee to be accurate (e.g. the sensor is under a cover when turned on and there are thus no points in the cloud).

Does cuda_pointcloud_preprocessor support changing dimenions of input pointclouds across iterations (e.g. starts with 0 rings in cloud 1, then 64 rings with 2000 points, then 64 rings with 5000 points each)?
If not, I'd suggest to make n_rings and max_points_per_ring parameters so that we can guarantee correct behavior at runtime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the new version
59144d8 and 5482f9c

Autodetection is performed automatically with very little overhead 👍

bool CudaOrganizedPointcloudAdapterNode::orderPointcloud(
const sensor_msgs::msg::PointCloud2::ConstSharedPtr input_pointcloud_msg_ptr)
{
const autoware::point_types::PointXYZIRCAEDT * input_buffer =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about bounds/type checking as above 🙇

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I ended up removing the adapter since it was heavier than the preprocessing 😅
59144d8 and 5482f9c

if (idx < num_points && masks[idx] == 1) {
output_points[indices[idx] - 1] = input_points[idx];
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two functions are identical except for their argument types. Consider making one templated function instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused kernels were deleted in
303b9ed

Copy link
Contributor

@manato manato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@knzo25
Thank you very much for proposing a fantastic PR, and I'm sorry for taking a long time for the review. From a viewpoint of CUDA usage, I left some comments. I'd appreciate it if you could consider them.

knzo25 and others added 2 commits January 10, 2025 18:29
…oud_preprocessor/cuda_pointcloud_preprocessor.cu

Co-authored-by: Manato Hirabayashi <[email protected]>
knzo25 and others added 14 commits January 10, 2025 18:32
…oud_preprocessor/cuda_pointcloud_preprocessor.cu

Co-authored-by: Manato Hirabayashi <[email protected]>
…oud_preprocessor/cuda_pointcloud_preprocessor.cu

Co-authored-by: Max Schmeller <[email protected]>
…oud_preprocessor/cuda_pointcloud_preprocessor.cu

Co-authored-by: Manato Hirabayashi <[email protected]>
…oud_preprocessor/cuda_pointcloud_preprocessor.cu

Co-authored-by: Manato Hirabayashi <[email protected]>
… approach is actually faster

Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>
Copy link
Contributor

@manato manato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@knzo25
Thank you for responding to my comments. I viewed the code and confirmed my major concern has been gone. Once the comments from @mojomex are solved, I believe this PR can be merged.

…y bottleneck is the H->D copy

Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>
@knzo25
Copy link
Contributor Author

knzo25 commented Feb 5, 2025

@mojomex
Max, I am sorry, but in 59144d8 and 5482f9c I changed a lot of things by deleting the cuda adapter node and moving its implementation to cuda (now the only hotspot is the host -> device copy, as you can see in the updated table)

@knzo25 knzo25 requested a review from mojomex February 5, 2025 09:34
@knzo25
Copy link
Contributor Author

knzo25 commented Feb 5, 2025

@manato
It seems to me that the host -> device transfer is too slow. It is paged (ROS), but it still seems off to me. Do you have a way to check these concerns? The theoretical bandwidth that I got with the cuda samples was super different than the bandwidth reported by nsight 😢 and there was nothing else happening on the GPU at the time

…ring it with the baseline

Signed-off-by: Kenzo Lobos-Tsunekawa <[email protected]>
Copy link
Contributor

@amadeuszsz amadeuszsz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added few minor comments. Regarding runtime, there was an issue with point cloud rings (see attached image), but I believe it's already fixed in f100224
Screenshot from 2025-02-05 20-21-52

About package architecture, I wonder if keeping cuda kernels with base c++ functions in same .cu files is the best way for doing this. On opposite side, kernels consist of actual cuda code and helper launch functions. I will let it up to you as you better see amount of remaining code here and potentially future development path. It may not be worth doubling the number of files.

For cuda kernels code, I will address my comments in next review, I need more time 🙏🏻

<arg name="input/twist" default="/sensing/vehicle_velocity_converter/twist_with_covariance"/>
<arg name="output/pointcloud" default="/sensing/lidar/top/test"/>

<arg name="cuda_pointcloud_preprocessor_param_file" default="$(find-pkg-share cuda_pointcloud_preprocessor)/config/cuda_pointcloud_preprocessor.param.yaml"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Package name is autoware_cuda_pointcloud_preprocessor


<arg name="cuda_pointcloud_preprocessor_param_file" default="$(find-pkg-share cuda_pointcloud_preprocessor)/config/cuda_pointcloud_preprocessor.param.yaml"/>

<node pkg="cuda_pointcloud_preprocessor" exec="cuda_pointcloud_preprocessor_node" name="cuda_pointcloud_preprocessor" output="screen">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Package name is autoware_cuda_pointcloud_preprocessor

twist.cum_y = cum_y;
twist.cum_theta = cum_theta;

std::uint64_t twist_global_stamp_nsec = twist_stamp;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a case where twist_stamp is uninitialized.

if(NOT ${CUDA_FOUND})
message(WARNING "cuda was not found, so the autoware_cuda_pointcloud_preprocessor package will not be built.")
return()
endif()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
Consider to add

elseif(CMAKE_BUILD_TYPE STREQUAL "Debug")
  list(APPEND CMAKE_CUDA_FLAGS "-g" "-G")
  list(APPEND CUDA_NVCC_FLAGS "-g" "-G")
endif()

to make cuda kernels debugging possible.

add_compile_definitions(ROS_DISTRO_GALACTIC)
elseif(${ROS_DISTRO} STREQUAL "humble")
add_compile_definitions(ROS_DISTRO_HUMBLE)
endif()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
Can we already address iron and jazzy distro?


void setCropBoxParameters(const std::vector<CropBoxParameters> & crop_box_parameters);
void setRingOutlierFilterParameters(const RingOutlierFilterParameters & ring_outlier_parameters);
void set3DUndistortion(bool value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function declarations and definitions usually consist of same parameter names (value vs. use_3d_undistortion)


#include <deque>
#include <memory>
#include <vector>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused header

#define AUTOWARE__CUDA_POINTCLOUD_PREPROCESSOR__OUTLIER_KERNELS_HPP_

#include "autoware/cuda_pointcloud_preprocessor/point_types.hpp"
#include "autoware/cuda_pointcloud_preprocessor/types.hpp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused header

#define AUTOWARE__CUDA_POINTCLOUD_PREPROCESSOR__ORGANIZE_KERNELS_HPP_

#include "autoware/cuda_pointcloud_preprocessor/point_types.hpp"
#include "autoware/cuda_pointcloud_preprocessor/types.hpp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused header

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:sensing Data acquisition from sensors, drivers, preprocessing. (auto-assigned) tag:require-cuda-build-and-test type:documentation Creating or refining documentation. (auto-assigned)
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

4 participants