-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Add basic NCCL communication space backend #128
Draft
dssgabriel
wants to merge
37
commits into
kokkos:develop
Choose a base branch
from
dssgabriel:feature/nccl-backend
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from 29 commits
Commits
Show all changes
37 commits
Select commit
Hold shift + click to select a range
838f74f
feat: add NCCL communication space
dssgabriel 710a488
feat: add NCCL datatypes
dssgabriel 59f9259
feat: add NCCL handle & request specializations
dssgabriel f815922
feat: add NCCL `send`/`recv` functions
dssgabriel e84fda8
feat: add NCCL `reduce`
dssgabriel 37eefe3
feat: add NCCL files to CMake
dssgabriel 5758e42
feat(nccl): correctly find and link NCCL
dssgabriel 6b13a6c
hotfix(nccl): make it so that MPI must be enabled to use NCCL
dssgabriel 32ae23b
feat(nccl): add class members
dssgabriel 4ca9e13
fix(nccl): fix type name for `ncclDataType_t` and add missing u8 conv…
dssgabriel c99bb49
feat(nccl): add explicit ctors for NCCL reqs
dssgabriel 3d3a96d
feat: add high-level reduction operators
dssgabriel 03b6d61
feat(nccl): add reduction operator conversion
dssgabriel b36aa46
feat(nccl): update NCCL packer to use `KokkosComm_contiguous`
dssgabriel 17abec6
fix(nccl): fix P2P & reduce NCCL functions
dssgabriel 30563af
feat(nccl): add `allgather` support
dssgabriel 100861e
feat(nccl): add high-level KokkosComm functions using NCCL backend
dssgabriel 5fa71bd
chore: ignore clang cache and compile commands
dssgabriel 788be5b
feat: add `constexpr` to traits everywhere possible
dssgabriel c40247e
chore: format
dssgabriel 856b9fa
fix(nccl): move `Nccl` members to `Handle<Nccl>` specialization
dssgabriel 3b52ff9
refactor(nccl): rename `get_inner` as `comm`
dssgabriel b7bd4b1
refactor: using a concept for defining reduction operators
dssgabriel a1d2201
refactor(nccl)!: enabling NCCL also forward-declares MPI
dssgabriel c1fa898
refactor: use template specializations for `Send`/`Recv` w/ NCCL
dssgabriel 93dbbde
feat: add `reduce` & `allgather` available as experimental functions
dssgabriel 9890159
refactor(nccl): use template specialization for `reduce`/`allgather`
dssgabriel 1e73707
chore: format
dssgabriel eb1e6b1
fix(cmake): add missing files to target sources
dssgabriel 9c185af
fix(nccl): correctly call collective implementations with `execute`
dssgabriel 267077a
feat(nccl): add collective functions not needing `Handle` object
dssgabriel 11789c7
Add a NCCL smoketest
cwpearson f3ba011
tests: restrict MPI-specific stuff to when MPI is enabled
cwpearson 9ba1666
add KOKKOSCOMM_ENABLE_NCCL to config
cwpearson c42dfb4
Always compile general unit tests
cwpearson de95394
nccl: fix is_communication_space scoping issue
cwpearson 6b25163
nccl: ncclDatatype_t -> ncclDataType_t
cwpearson File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,4 +5,6 @@ Testing | |
kokkos | ||
docs/_build | ||
.python-version | ||
.venv | ||
.venv | ||
.cache | ||
compile_commands.json |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,99 @@ | ||
# Find the NCCL libraries. | ||
# Copied from pytorch: https://github.com/pytorch/pytorch/blob/main/cmake/Modules/FindNCCL.cmake | ||
# | ||
# The following variables are optionally searched for defaults | ||
# NCCL_ROOT: Base directory where all NCCL components are found | ||
# NCCL_INCLUDE_DIR: Directory where NCCL header is found | ||
# NCCL_LIB_DIR: Directory where NCCL library is found | ||
# | ||
# The following are set after configuration is done: | ||
# NCCL_FOUND | ||
# NCCL_INCLUDE_DIRS | ||
# NCCL_LIBRARIES | ||
# | ||
# The path hints include CUDA_TOOLKIT_ROOT_DIR seeing as some folks | ||
# install NCCL in the same location as the CUDA toolkit. | ||
# See https://github.com/caffe2/caffe2/issues/1601 | ||
|
||
set(NCCL_INCLUDE_DIR $ENV{NCCL_INCLUDE_DIR} CACHE PATH "Folder contains NVIDIA NCCL headers") | ||
set(NCCL_LIB_DIR $ENV{NCCL_LIB_DIR} CACHE PATH "Folder contains NVIDIA NCCL libraries") | ||
set(NCCL_VERSION $ENV{NCCL_VERSION} CACHE STRING "Version of NCCL to build with") | ||
|
||
if($ENV{NCCL_ROOT_DIR}) | ||
message(WARNING "NCCL_ROOT_DIR is deprecated. Please set NCCL_ROOT instead.") | ||
endif() | ||
list(APPEND NCCL_ROOT $ENV{NCCL_ROOT_DIR} ${CUDA_TOOLKIT_ROOT_DIR}) | ||
# Compatible layer for CMake <3.12. NCCL_ROOT will be accounted in for searching paths and libraries for CMake >=3.12. | ||
list(APPEND CMAKE_PREFIX_PATH ${NCCL_ROOT}) | ||
|
||
find_path(NCCL_INCLUDE_DIRS NAMES nccl.h HINTS ${NCCL_INCLUDE_DIR}) | ||
|
||
if(USE_STATIC_NCCL) | ||
message(STATUS "USE_STATIC_NCCL is set. Linking with static NCCL library.") | ||
set(NCCL_LIBNAME "nccl_static") | ||
if(NCCL_VERSION) # Prefer the versioned library if a specific NCCL version is specified | ||
set(CMAKE_FIND_LIBRARY_SUFFIXES ".a.${NCCL_VERSION}" ${CMAKE_FIND_LIBRARY_SUFFIXES}) | ||
endif() | ||
else() | ||
set(NCCL_LIBNAME "nccl") | ||
if(NCCL_VERSION) # Prefer the versioned library if a specific NCCL version is specified | ||
set(CMAKE_FIND_LIBRARY_SUFFIXES ".so.${NCCL_VERSION}" ${CMAKE_FIND_LIBRARY_SUFFIXES}) | ||
endif() | ||
endif() | ||
|
||
find_library(NCCL_LIBRARIES NAMES ${NCCL_LIBNAME} HINTS ${NCCL_LIB_DIR}) | ||
|
||
include(FindPackageHandleStandardArgs) | ||
find_package_handle_standard_args(NCCL DEFAULT_MSG NCCL_INCLUDE_DIRS NCCL_LIBRARIES) | ||
|
||
if(NCCL_FOUND) # obtaining NCCL version and some sanity checks | ||
set(NCCL_HEADER_FILE "${NCCL_INCLUDE_DIRS}/nccl.h") | ||
message(STATUS "Determining NCCL version from ${NCCL_HEADER_FILE}...") | ||
set(OLD_CMAKE_REQUIRED_INCLUDES ${CMAKE_REQUIRED_INCLUDES}) | ||
list(APPEND CMAKE_REQUIRED_INCLUDES ${NCCL_INCLUDE_DIRS}) | ||
include(CheckCXXSymbolExists) | ||
check_cxx_symbol_exists(NCCL_VERSION_CODE nccl.h NCCL_VERSION_DEFINED) | ||
|
||
if(NCCL_VERSION_DEFINED) | ||
set(file "${PROJECT_BINARY_DIR}/detect_nccl_version.cc") | ||
file( | ||
WRITE | ||
${file} | ||
" | ||
#include <iostream> | ||
#include <nccl.h> | ||
int main() | ||
{ | ||
std::cout << NCCL_MAJOR << '.' << NCCL_MINOR << '.' << NCCL_PATCH << std::endl; | ||
|
||
int x; | ||
ncclGetVersion(&x); | ||
return x == NCCL_VERSION_CODE; | ||
} | ||
" | ||
) | ||
try_run( | ||
NCCL_VERSION_MATCHED | ||
compile_result | ||
${PROJECT_BINARY_DIR} | ||
${file} | ||
RUN_OUTPUT_VARIABLE NCCL_VERSION_FROM_HEADER | ||
CMAKE_FLAGS "-DINCLUDE_DIRECTORIES=${NCCL_INCLUDE_DIRS}" | ||
LINK_LIBRARIES ${NCCL_LIBRARIES} | ||
) | ||
if(NOT NCCL_VERSION_MATCHED) | ||
message( | ||
FATAL_ERROR | ||
"Found NCCL header version and library version do not match! \ | ||
(include: ${NCCL_INCLUDE_DIRS}, library: ${NCCL_LIBRARIES}) Please set NCCL_INCLUDE_DIR and NCCL_LIB_DIR manually." | ||
) | ||
endif() | ||
message(STATUS "NCCL version: ${NCCL_VERSION_FROM_HEADER}") | ||
else() | ||
message(STATUS "NCCL version < 2.3.5-5") | ||
endif() | ||
set(CMAKE_REQUIRED_INCLUDES ${OLD_CMAKE_REQUIRED_INCLUDES}) | ||
|
||
message(STATUS "Found NCCL (include: ${NCCL_INCLUDE_DIRS}, library: ${NCCL_LIBRARIES})") | ||
mark_as_advanced(NCCL_ROOT_DIR NCCL_INCLUDE_DIRS NCCL_LIBRARIES) | ||
endif() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,16 +16,21 @@ | |
|
||
#pragma once | ||
|
||
#include <vector> | ||
|
||
#include "concepts.hpp" | ||
#include <KokkosComm/concepts.hpp> | ||
#include <KokkosComm/config.hpp> | ||
#include <KokkosComm/reduction_op.hpp> | ||
|
||
namespace KokkosComm { | ||
|
||
#if defined(KOKKOSCOMM_ENABLE_MPI) | ||
class Mpi; | ||
using DefaultCommunicationSpace = Mpi; | ||
using FallbackCommunicationSpace = Mpi; | ||
#elif defined(KOKKOSCOMM_ENABLE_NCCL) | ||
class Mpi; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I should depend on MPI in this case. |
||
class Nccl; | ||
using DefaultCommunicationSpace = Nccl; | ||
using FallbackCommunicationSpace = Mpi; | ||
#else | ||
#error at least one transport must be defined | ||
#endif | ||
|
@@ -42,13 +47,29 @@ namespace Impl { | |
template <KokkosView RecvView, KokkosExecutionSpace ExecSpace = Kokkos::DefaultExecutionSpace, | ||
CommunicationSpace CommSpace = DefaultCommunicationSpace> | ||
struct Recv; | ||
|
||
template <KokkosView SendView, KokkosExecutionSpace ExecSpace = Kokkos::DefaultExecutionSpace, | ||
CommunicationSpace CommSpace = DefaultCommunicationSpace> | ||
struct Send; | ||
|
||
template <KokkosExecutionSpace ExecSpace = Kokkos::DefaultExecutionSpace, | ||
CommunicationSpace CommSpace = DefaultCommunicationSpace> | ||
struct Barrier; | ||
|
||
} // namespace Impl | ||
|
||
} // namespace KokkosComm | ||
// Allgather and Reduce are currently experimental functions | ||
namespace Experimental::Impl { | ||
|
||
template <KokkosView SendView, KokkosView RecvView, KokkosExecutionSpace ExecSpace = Kokkos::DefaultExecutionSpace, | ||
CommunicationSpace CommSpace = DefaultCommunicationSpace> | ||
struct AllGather; | ||
|
||
template <KokkosView SendView, KokkosView RecvView, ReductionOperator RedOp, | ||
KokkosExecutionSpace ExecSpace = Kokkos::DefaultExecutionSpace, | ||
CommunicationSpace CommSpace = DefaultCommunicationSpace> | ||
struct Reduce; | ||
|
||
} // namespace Experimental::Impl | ||
|
||
} // namespace KokkosComm |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not exactly understand how it will work. Don't we need to call it
execute()
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I fixed this in
9c185af
(#128).