Skip to content

Releases: openucx/ucx

v1.18.0

21 Jan 09:39
693d028
Compare
Choose a tag to compare

1.18.0 (January 17, 2025)

Features:

UCP

  • Enabled using CUDA staging buffers for pipeline protocols by default
  • Added endpoint reconfiguration support for non-reused p2p scenarios
  • Enabled non-cacheable memory domains, activated for gdr_copy
  • Added user_data parameter to ucp_ep_query
  • Added support for host memory pipeline through CUDA buffers for rendezvous protocol
  • Added global VA infrastructure and memory region in absence of error handling
  • Made protocol performance node names more informative
  • Enforced always running on the same thread in single thread mode
  • Multiple improvements in protocols selection infrastructure
  • Added UCP_MEM_MAP_LOCK API flag to enforce locked memory mapping
  • Allowed up-to 64 endpoint lanes for systems with many transports or devices
  • Added usage tracker to worker
  • Improved various logging messages

RDMA CORE (IB, ROCE, etc.)

  • Added environment variable to manage DC initiator capacity
  • Added DC dcs_hybrid policy
  • Reduced MLX5/DV stack size consumption
  • Added ODP support for verbs and mlx5dv
  • Added support of CUDA managed memory on IB when ODP is available
  • Added support of Adaptive Routing on RoCE
  • Enabled use of implicit ODP with relaxed ordering
  • Improved GPU-Direct detection in IB transport
  • Increased DC initiator default count to 32 for performance optimization
  • Added ConnectX-8 device support with DDP
  • Added support for subnet filter list for RoCE interfaces
  • Enhanced the error message to provide more details when a connection cannot be established due to unreachable transports
  • Added IB MLX5 as a separate UCX module with separate RPM sub-package
  • Added initial support for GGA transport, for fast DPU memory access
  • Set IB DevX atomic mode based on device capabilities
  • Removed DC keepalive mechanism, since the keepalive is done on UCP layer
  • Optimized cross-gVMI memory registration using indirect memory keys cache
  • Improved various logging messages

CUDA

  • Added multi-node NVlink support
  • Added CUDA Fabric memory support with detection and allocation
  • Improved gdr_copy latency estimations on AMD Milan systems
  • Added check for gdr_copy runtime/build version mismatch
  • Added handling missing IPC capability when unpacking keys
  • Added caching for CUDA IPC memory pool import operation
  • Added gdr_copy variables to optimize performance on Grace Hopper systems
  • Improved CUDA IPC concurrency for a larger count of reachable peers

UCS

  • Added support for wildcards in configuration parameter names
  • Added ASAN protection to several internal data structures
  • Reduced stack usage in topology detection code
  • Improved bitmaps configuration parsing with wider bitfield
  • Added options to set topology distance between devices
  • Optimized VFS unix socket watch by using user private folder
  • Added general IP subnet matching infrastructure
  • Extend array data structure to support user-provided array copy routine
  • Improved time units description

UCM

  • Extend CUDA memory hooks to include memory mapping APIs

Tools

  • Improved performance by increasing window size for put_bw and add get_bw in ucx_perftest
  • Added multi-send flag for receive operations in bandwidth benchmarks in ucx_perftest
  • Improved ucx_perftest uni-directional test with added fence
  • Detailed ucx_perftest batch section of command-line documentation

Documentation

  • Added a section regarding adaptive routing on RoCE

Architecture

  • Added CPU Model for MI300A
  • Added Fujitsu ARM specific values to ucx.conf
  • Added AMD Turin support
  • Added an optimized non-temporal memory copy implementation for AMD CPU

Build

  • Improved compiler error reporting with added flag
  • Improved coverity script to allow faster turnaround time
  • Improved Intel Compiler detection and support

GO

  • Added multi-send flag and user memh support in request params

Packaging

  • Improved dpkg-buildpackage sample command by explicitly adding mlx5 related arguments

Bugfixes:

UCP

  • Fixed stack overflow in exported rkey unpack
  • Removed extra remote-cpu overhead from protocol estimation for zcopy
  • Fixed performance estimation for rndv pipeline protocols
  • Fixed ATP sending by picking the correct lane
  • Fixed missing reg_id on memh creation
  • Fixed repeated invalidations by retaining existing access flags
  • Fixed abort reason propagation for rendezvous RTR mtype
  • Do not check transport availability if it is disabled by UCX_TLS environment variable
  • Fixed wrong flag being used for checking BCOPY capability
  • Fixed sending too many ATPs for small messages
  • Enforced 16 bits size for Active Messages identifiers
  • Fixed unnecessary status check for emulated AMO
  • Fixed more than one fragment sending in rendezvous pipeline
  • Fixed crash by using biggest max frag across all lanes
  • Fixed missing memory handle flags by copying from parent to child
  • Fixed worker interface activate count
  • Fixed flush requests by replacing ATP/flush lane map with lane indexes
  • Fixed lost uct_flags when merging memory regions

UCT

  • Fixed memory domain UCT flags description

RDMA CORE (IB, ROCE, etc.)

  • Fixed FETCH_ADD remote access error for ODP/KSM case
  • Fixed missing conditional compilation checks for DM
  • Fixed IB MD allocation naming typo
  • Fixed invalid GIDs filter in IB
  • Fixed flags usage in MLX5 zcopy_post
  • Do not limit ODP registration retries
  • Fixed JUCX failures by considering the number of supported completion vectors

CUDA

  • Fixed async memory handling using CUDA memory type on Grace
  • Added rcache overhead in performance estimation
  • Fixed gdr_copy performance regression by providing maximum estimation between get and put
  • Fixed CUDA IPC reachability check
  • Fixed crash in MPI_Finalize when CUDA context is destroyed
  • Always require rcache by default for gdr_copy
  • Fixed crash in gdr_copy cleanup when registration cache is disabled
  • Fixed CUDA copy memory domain allocations
  • Fixed multiple tests for gdr_copy transport
  • Fixed race condition in CUDA IPC peer accessible cache

UCS

  • Fixed a crash by using heap allocation to process expired timers in batch
  • Fixed allocation issue on memtrack dump
  • Fixed deletion of the monitored folder in VFS
  • Fixed unsafe resize for DC initiator array
  • Fixed function macro invocation to match C standard
  • Fixed calling async handler on already released resource
  • Fixed performance by setting higher bandwidth for different NUMA nodes on Grace
  • Fixed undeclared value error in timer conversion routine
  • Fixed uninitialized value access in registration cache

UCM

  • Fixed race condition in parsing proc maps
  • Fixed mremap failure while parsing /proc/self/maps

ROCM

  • Fixed ROCM interface reachability test
  • Fixed memory domain fork test

TCP

  • Always bind endpoint to interface

Tools

  • Fixed buffer size potential overflow in ucx_perftest
  • Fixed missing address when packing memory keys on ucx_perftest
  • Fixed memory leak for endpoint report in ucx_info
  • Fixed build without openmp in ucx_perftest
  • Fixed UCT device override on server side on ucx_perftest

Build

  • Fixed using correct ASAN version for running tests

Configuration

  • Used POSIX bourne syntax to check equality
  • Fixed build failure by using proper flags in compiler.m4
  • Fixed perftest MAD support default guessing

GO

  • Added serialized thread mode to avoid subtle races between threads
  • Fixed make distcheck

v1.18.0 RC3

23 Dec 17:06
9ce35d0
Compare
Choose a tag to compare
v1.18.0 RC3 Pre-release
Pre-release

1.18.0-rc3 (December 23, 2024)

Features:

UCP

  • Enabled using CUDA staging buffers for pipeline protocols by default
  • Added endpoint reconfiguration support for non-reused p2p scenarios
  • Enabled non-cacheable memory domains, activated for gdr_copy
  • Added user_data parameter to ucp_ep_query
  • Added support for host memory pipeline through CUDA buffers for rendezvous protocol
  • Added global VA infrastructure and memory region in absence of error handling
  • Made protocol performance node names more informative
  • Enforced always running on the same thread in single thread mode
  • Multiple improvements in protocols selection infrastructure
  • Added UCP_MEM_MAP_LOCK API flag to enforce locked memory mapping
  • Allowed up-to 64 endpoint lanes for systems with many transports or devices
  • Added usage tracker to worker
  • Improved various logging messages

RDMA CORE (IB, ROCE, etc.)

  • Added environment variable to manage DC initiator capacity
  • Added DC dcs_hybrid policy
  • Reduced MLX5/DV stack size consumption
  • Added ODP support for verbs and mlx5dv
  • Added support of CUDA managed memory on IB when ODP is available
  • Added support of Adaptive Routing on RoCE
  • Enabled use of implicit ODP with relaxed ordering
  • Improved GPU-Direct detection in IB transport
  • Increased DC initiator default count to 32 for performance optimization
  • Added ConnectX-8 device support with DDP
  • Added support for subnet filter list for RoCE interfaces
  • Enhanced the error message to provide more details when a connection cannot be established due to unreachable transports
  • Added IB MLX5 as a separate UCX module with separate RPM sub-package
  • Added initial support for GGA transport, for fast DPU memory access
  • Set IB DevX atomic mode based on device capabilities
  • Removed DC keepalive mechanism, since the keepalive is done on UCP layer
  • Optimized cross-gVMI memory registration using indirect memory keys cache
  • Improved various logging messages

CUDA

  • Added multi-node NVlink support
  • Added CUDA Fabric memory support with detection and allocation
  • Improved gdr_copy latency estimations on AMD Milan systems
  • Added check for gdr_copy runtime/build version mismatch
  • Added handling missing IPC capability when unpacking keys
  • Added caching for CUDA IPC memory pool import operation
  • Added gdr_copy variables to optimize performance on Grace Hopper systems
  • Improved CUDA IPC concurrency for a larger count of reachable peers

UCS

  • Added support for wildcards in configuration parameter names
  • Added ASAN protection to several internal data structures
  • Reduced stack usage in topology detection code
  • Improved bitmaps configuration parsing with wider bitfield
  • Added options to set topology distance between devices
  • Optimized VFS unix socket watch by using user private folder
  • Added general IP subnet matching infrastructure
  • Extend array data structure to support user-provided array copy routine
  • Improved time units description

UCM

  • Extend CUDA memory hooks to include memory mapping APIs

Tools

  • Improved performance by increasing window size for put_bw and add get_bw in ucx_perftest
  • Added multi-send flag for receive operations in bandwidth benchmarks in ucx_perftest
  • Improved ucx_perftest uni-directional test with added fence
  • Detailed ucx_perftest batch section of command-line documentation

Documentation

  • Added a section regarding adaptive routing on RoCE

Architecture

  • Added CPU Model for MI300A
  • Added Fujitsu ARM specific values to ucx.conf
  • Added AMD Turin support
  • Added an optimized non-temporal memory copy implementation for AMD CPU

Build

  • Improved compiler error reporting with added flag
  • Improved coverity script to allow faster turnaround time
  • Improved Intel Compiler detection and support

GO

  • Added multi-send flag and user memh support in request params

Packaging

  • Improved dpkg-buildpackage sample command by explicitly adding mlx5 related arguments

Bugfixes:

UCP

  • Fixed stack overflow in exported rkey unpack
  • Removed extra remote-cpu overhead from protocol estimation for zcopy
  • Fixed performance estimation for rndv pipeline protocols
  • Fixed ATP sending by picking the correct lane
  • Fixed missing reg_id on memh creation
  • Fixed repeated invalidations by retaining existing access flags
  • Fixed abort reason propagation for rendezvous RTR mtype
  • Do not check transport availability if it is disabled by UCX_TLS environemnt variable
  • Fixed wrong flag being used for checking BCOPY capability
  • Fixed sending too many ATPs for small messages
  • Enforced 16 bits size for Active Messages identifiers
  • Fixed unnecessary status check for emulated AMO
  • Fixed more than one fragment sending in rendezvous pipeline
  • Fixed crash by using biggest max frag across all lanes
  • Fixed missing memory handle flags by copying from parent to child
  • Fixed worker interface activate count
  • Fixed flush requests by replacing ATP/flush lane map with lane indexes
  • Fixed lost uct_flags when merging memory regions

UCT

  • Fixed memory domain UCT flags description

RDMA CORE (IB, ROCE, etc.)

  • Fixed FETCH_ADD remote access error for ODP/KSM case
  • Fixed missing conditional compilation checks for DM
  • Fixed IB MD allocation naming typo
  • Fixed invalid GIDs filter in IB
  • Fixed flags usage in MLX5 zcopy_post
  • Do not limit ODP registration retries
  • Fixed JUCX failures by considering the number of supported completion vectors

CUDA

  • Fixed async memory handling using CUDA memory type on Grace
  • Added rcache overhead in performance estimation
  • Fixed gdr_copy performance regression by providing maximum estimation between get and put
  • Fixed CUDA IPC reachability check
  • Fixed crash in MPI_Finalize when CUDA context is destroyed
  • Always require rcache by default for gdr_copy
  • Fixed crash in gdr_copy cleanup when registration cache is disabled
  • Fixed CUDA copy memory domain allocations
  • Fixed multiple tests for gdr_copy transport
  • Fixed race condition in CUDA IPC peer accessible cache

UCS

  • Fixed a crash by using heap allocation to process expired timers in batch
  • Fixed allocation issue on memtrack dump
  • Fixed deletion of the monitored folder in VFS
  • Fixed unsafe resize for DC initiator array
  • Fixed function macro invocation to match C standard
  • Fixed calling async handler on already released resource
  • Fixed performance by setting higher bandwidth for different NUMA nodes on Grace
  • Fixed undeclared value error in timer conversion routine
  • Fixed uninitialized value access in registration cache

UCM

  • Fixed race condition in parsing proc maps
  • Fixed mremap failure while parsing /proc/self/maps

ROCM

  • Fixed ROCM interface reachability test
  • Fixed memory domain fork test

TCP

  • Always bind endpoint to interface

Tools

  • Fixed buffer size potential overflow in ucx_perftest
  • Fixed missing address when packing memory keys on ucx_perftest
  • Fixed memory leak for endpoint report in ucx_info
  • Fixed build without openmp in ucx_perftest
  • Fixed UCT device override on server side on ucx_perftest

Build

  • Fixed using correct ASAN version for running tests

Configuration

  • Used POSIX bourne syntax to check equality
  • Fixed build failure by using proper flags in compiler.m4
  • Fixed perftest MAD support default guessing

GO

  • Added serialized thread mode to avoid subtle races between threads
  • Fixed make distcheck

v1.18.0 RC2

10 Dec 16:40
e992f1b
Compare
Choose a tag to compare
v1.18.0 RC2 Pre-release
Pre-release

1.18.0-rc2 (December 10, 2024)

Features: TBD

Bugfixes: TBD

v1.18.0 RC1

26 Nov 13:25
a0fb15f
Compare
Choose a tag to compare
v1.18.0 RC1 Pre-release
Pre-release

1.18.0-rc1 (November 26, 2024)

Features: TBD

Bugfixes: TBD

v1.17.0

13 Jun 15:35
4ef9a09
Compare
Choose a tag to compare

1.17.0 (June 13, 2024)

Features:

UCP

  • Improved the accuracy of rendezvous protocol performance estimation
  • Enabled short protocol for non-host memory types on empty messages
  • Improved the accuracy of performance estimation for empty messages by removing non-relevant overheads
  • Added RMA_ZCOPY_MAX_SEG_SIZE configuration parameter to allow modifying segment size for RMA-ZCOPY protocols
  • Added support for separate intra/inter-node rendezvous thresholds
  • Added support for minimal fragment size in rendezvous protocol
  • Added support for resetting request during send operation
  • Added UCX_PROTO_OVERHEAD configuration variable to allow setting protocol overheads
  • Improved performance for combined Active Message/RMA scenarios by separating them to different lanes
  • Added support for device staging buffers in pipeline protocols
  • Enabled on-demand paging for Nvidia's Grace platforms by default

RDMA CORE (IB, ROCE, etc.)

  • Introduced the UCX_REVERSE_SL environment variable to configure reverse SL for DC transport. By default, it uses UCX_IB_SL.
  • Added support for GID auto-detection in Floating LID based routing
  • Added support for multithreading KSM registration of unaligned buffers
  • Added IB_SEND_OVERHEAD and MM_[SEND|RECV]_OVERHEAD configuration variables

GPU (CUDA, ROCM)

  • Added support for oneAPI Level-Zero library for Intel GPUs

UCS

  • Added support for rcache dynamic region alignment
  • Added dynamic bitmap data structure
  • Added support for advanced key-value parsing for UCX configuration
  • Added piecewise linear function data structure
  • Added support for allocating dynamic arrays on stack

Tools

  • Added support for device memory allocation in UCX perftest
  • Added a script to use for squashing commits after PR approval
  • Added support for DPU cross-gvmi daemon in UCX perftest

Java

  • Added support for EP local socket address API in JUCX

Build

  • Added address sanitizer support
  • Added a helper shell script to run static checks

AZP

  • Replaced Valgrind tests with address sanitizer tool
  • Added Ubuntu 22.04 docker image testing

Configuration

  • Added support for filtering configuration sections by platform type
  • Added configuration file with section for Grace Hopper

Bugfixes:

UCP

  • Fixed crash due to incorrect lane selection when active message is disabled
  • Fixed RMA lane selection issue due to wrong bandwidth calculation
  • Fixed rendezvous protocol information in protocol details table
  • Fixed endpoint reconfiguration issue due to wrong bandwidth calculation
  • Fixed Active Message handlers issue due to out of order registration
  • Fixed registration of memh evens for imported memory key
  • Fixed sockaddr unreachable destination error handling
  • Fixed uninitialized memory issue in new protocols infrastructure
  • Fixed race condition when using strong fence by flushing all endpoints
  • Fixed incorrect RMA message size on immediate completion with no datatype
  • Fixed incorrect performance estimation due to fp8 pack/unpack issue
  • Fixed remote access error when rcache memory is not registered with atomic access
  • Fixed assertion failure when rcache fails during memh allocation
  • Fixed atomic device selection issue
  • Fixed worker interface deactivation while still in use by endpoints
  • Fixed wire compatibility issue due to mismatched lane selection

RDMA CORE (IB, ROCE, etc.)

  • Disabled device memory if atomics are not available
  • Fixed indirect keys creation for MT registered memory
  • Fixed KSM start address value when creating export key
  • Fixed DCI pool index to support maximum of 16 pools
  • Fixed atomic rkey issue when using imported memory
  • Fixed crash due to unsupported SRQ capability

GPU (CUDA, ROCM)

  • Removed unused environment variable RCACHE_ADDR_ALIGN from ROCm transport
  • Fixed usage of cuda device 0 when no context is active
  • Removed error handling support from CUDA IPC transport
  • Fixed allocation of unaligned CUDA memory

Shared Memory

  • Fixed occasional crash when shm_unlink fails during interface initialization

UCS

  • Fixed system device distance calculation for devices on different PCIe root
  • Fixed support for large size arrays in ucs_array
  • Fixed synchronization issue in rcache
  • Fixed uninitialized variable access in rcache

Tests

  • Fixed test failures when GPU is present but disabled
  • Fixed Active Message hanging issue in ucp_client_server
  • Fixed potential crash due to redundant munmap call in ucp mmap tests
  • Fixed a crash when running CUDA gtest under valgrind
  • Fixed UD endpoint timeout issue under Valgrind

Java

  • Fixed failures in Java tests by waiting for send requests completion
  • Fixed JVM segfault in Java tests when gdrcopy driver is not loaded
  • Fixed go build and go tests failures

Packaging

  • Disabled Go bindings in Debian package

v1.17.0 RC3

06 Jun 16:47
770b5a6
Compare
Choose a tag to compare
v1.17.0 RC3 Pre-release
Pre-release

1.17.0 RC3 (June 6, 2024)

Bugfixes:

UCP

  • Fixed wire compatibility issue due to mismatched lane selection

UCS

  • Fixed uninitialized variable access in rcache

v1.17.0 RC2

03 Jun 08:10
9cec0d4
Compare
Choose a tag to compare
v1.17.0 RC2 Pre-release
Pre-release

1.17.0 RC2 (May 29, 2024)

Features:

UCP

  • Improved the accuracy of rendezvous protocol performance estimation
  • Enabled short protocol for non-host memory types on empty messages
  • Improved the accuracy of performance estimation for empty messages by removing non-relevant overheads
  • Added RMA_ZCOPY_MAX_SEG_SIZE configuration parameter to allow modifying segment size for RMA-ZCOPY protocols
  • Added support for separate intra/inter-node rendezvous thresholds
  • Added support for minimal fragment size in rendezvous protocol
  • Added support for resetting request during send operation
  • Added UCX_PROTO_OVERHEAD configuration variable to allow setting protocol overheads
  • Improved performance for combined Active Message/RMA scenarios by separating them to different lanes
  • Added support for device staging buffers in pipeline protocols
  • Enabled on-demand paging for Nvidia's Grace platforms by default

RDMA CORE (IB, ROCE, etc.)

  • Introduced the UCX_REVERSE_SL environment variable to configure reverse SL for DC transport. By default, it uses UCX_IB_SL.
  • Added support for GID auto-detection in Floating LID based routing
  • Added support for multithreading KSM registration of unaligned buffers
  • Added IB_SEND_OVERHEAD and MM_[SEND|RECV]_OVERHEAD configuration variables

GPU (CUDA, ROCM)

  • Added support for oneAPI Level-Zero library for Intel GPUs

UCS

  • Added support for rcache dynamic region alignment
  • Added dynamic bitmap data structure
  • Added support for advanced key-value parsing for UCX configuration
  • Added piecewise linear function data structure
  • Added support for allocating dynamic arrays on stack

Tools

  • Added support for device memory allocation in UCX perftest
  • Added a script to use for squashing commits after PR approval
  • Added support for DPU cross-gvmi daemon in UCX perftest

Java

  • Added support for EP local socket address API in JUCX

Build

  • Added address sanitizer support
  • Added a helper shell script to run static checks

AZP

  • Replaced Valgrind tests with address sanitizer tool
  • Added Ubuntu 22.04 docker image testing

Configuration

  • Added support for filtering configuration sections by platform type
  • Added configuration file with section for Grace Hopper

Bugfixes:

UCP

  • Fixed crash due to incorrect lane selection when active message is disabled
  • Fixed RMA lane selection issue due to wrong bandwidth calculation
  • Fixed rendezvous protocol information in protocol details table
  • Fixed endpoint reconfiguration issue due to wrong bandwidth calculation
  • Fixed Active Message handlers issue due to out of order registration
  • Fixed registration of memh evens for imported memory key
  • Fixed sockaddr unreachable destination error handling
  • Fixed uninitialized memory issue in new protocols infrastructure
  • Fixed race condition when using strong fence by flushing all endpoints
  • Fixed incorrect RMA message size on immediate completion with no datatype
  • Fixed incorrect performance estimation due to fp8 pack/unpack issue
  • Fixed remote access error when rcache memory is not registered with atomic access
  • Fixed assertion failure when rcache fails during memh allocation
  • Fixed atomic device selection issue
  • Fixed worker interface deactivation while still in use by endpoints

RDMA CORE (IB, ROCE, etc.)

  • Disabled device memory if atomics are not available
  • Fixed indirect keys creation for MT registered memory
  • Fixed KSM start address value when creating export key
  • Fixed DCI pool index to support maximum of 16 pools
  • Fixed atomic rkey issue when using imported memory
  • Fixed crash due to unsupported SRQ capability

GPU (CUDA, ROCM)

  • Removed unused environment variable RCACHE_ADDR_ALIGN from ROCm transport
  • Fixed usage of cuda device 0 when no context is active
  • Removed error handling support from CUDA IPC transport
  • Fixed allocation of unaligned CUDA memory

Shared Memory

  • Fixed occasional crash when shm_unlink fails during interface initialization

UCS

  • Fixed system device distance calculation for devices on different PCIe root
  • Fixed support for large size arrays in ucs_array
  • Fixed synchronization issue in rcache

Tests

  • Fixed test failures when GPU is present but disabled
  • Fixed Active Message hanging issue in ucp_client_server
  • Fixed potential crash due to redundant munmap call in ucp mmap tests
  • Fixed a crash when running CUDA gtest under valgrind
  • Fixed UD endpoint timeout issue under Valgrind

Java

  • Fixed failures in Java tests by waiting for send requests completion
  • Fixed JVM segfault in Java tests when gdrcopy driver is not loaded
  • Fixed go build and go tests failures

Packaging

  • Disabled Go bindings in Debian package

v1.17.0 RC1

16 May 14:25
0233ba6
Compare
Choose a tag to compare
v1.17.0 RC1 Pre-release
Pre-release

1.17.0 RC1 (May 16, 2024)

TBD

v1.16.0

16 Apr 13:39
e4bb802
Compare
Choose a tag to compare

1.16.0 (April 15, 2024)

Features:

UCP

  • Added tag offload rendezvous protocol in new infrastructure
  • Added rcache to old protocols infrastructure
  • Added multi-fragment protocols for stream API in new infrastructure
  • Enabled new protocols infrastructure by default
  • Removed context param from ucp_memh_put
  • Added assertion if trying to register unsupported memory type
  • Adjusted rendezvous latency to improve scalability
  • Improved endpoint configuration logging information
  • Added check for max length of user defined Active Message header
  • Added rcache support for mem type memory registration
  • Enabled error handling for rndv/put_zcopy protocol
  • Enabled v2 as default client/server connection establishment packet version
  • Enabled rendezvous protocol selection for reachable MDs only
  • Added ucp_rkey_compare API to enable rkey comparison
  • Added release version to worker address to enable wire compatability
  • Added support for memory invalidation for rendezvous through DC transport
  • Enabled the use of strong fence with new protocols infrastructure

UCT

  • Added UCS_MEMORY_TYPE_RDMA memory type for better latency on supported devices
  • Implemented is_reachable_v2 API for IB transport
  • Added ep_is_conntected API

RDMA CORE (IB, ROCE, etc.)

  • Added Floating LID(FLID) based routing support
  • Added latency and min_zcopy configuration variables to ROCm-IPC
  • Added support for indirect MR for cross-gvmi mkey instead of direct MR with DEVX UMEM

TCP

  • Added filter for eliminate bridge devices from lane selection

GPU (CUDA, ROCM)

  • Added support for handling memh with multiple registrations
  • Added performance estimation BW based on GPU type
  • Adjusted rocm/ipc latency and zcopy threshold parameters
  • Improved error message when libnvidia-ml not installed
  • Added profiling to Cuda runtime API calls
  • Adjusted gdr_copy estimated BW to improve protocol selection

Shared Memory

  • Adjusted FIFO_SIZE to improve scalability
  • Removed redundent rcahce implementation in knem transport
  • Added support for symmetric rkey to improve memory usage

UCS

  • Improved scalability of connection establishment flow
  • Improved memtype cache performance by replacing ptrhead_lock to spinlock
  • Added support for VLAN over channel bonding interface
  • Added LRU cache and Usage Tracker datastructures
  • Improved cross-NUMA device detection
  • Added support for PCIe gen5 bandwidth detection

Build

  • Added LCOV coverage report as a build option
  • Added binutils 2.40 library dependencies
  • Added development modulefile

Tools

  • Added information about sizes of ucp_request_t fields in ucx_info
  • Added ucx env to profiling output
  • Added MAD RTE in ucx_perftest to support setups without IPoIB

Tests

  • Added GTEST_LOG_LEVEL env var to set log level just before test run
  • Disabled protov1 and ud_verbs tests for valgrind mode
  • Reduced gtest execution time

Documentation

  • Added a few details to coding style

Bugfixes:

UCP

  • Reverted wireup latency calculation which caused lanes selection issue
  • Fixed strong fence to always ensure ordering
  • Fixed registration of memh for RNDV protocol
  • Fixed rndv_put and rkey_ptr assertion failure
  • Fixed performance estimation for multi-fragment protocols
  • Fixed memory registration error handling
  • Fixed buffer overflow of large log messages
  • Fixed progress enabling for selected lanes
  • Fixed atomic lanes progress enabling
  • Added missing rendezvous schemes to environment variable documentation
  • Fixed bcopy BW estimation for AMD
  • Fixed lanes information printing for new protocols infrastructure
  • Fixed rndv_am protocol thresholds
  • Fixed fp8 packing issue
  • Fixed Intel OneAPI compilation error
  • Fixed CM address packing on server side
  • Fixed endpoint reconfiguration issue due to asymmetrical selection
  • Fixed asymmetrical selection due to wire compatability issue
  • Fixed potential deadlock with cuda_copy and RTR protocol
  • Fixed tag_recv return value on immediate completion
  • Fixed memory corruption by proper memh handling in tag offload rendezvous
  • Changed default allocator to not use reserved huge pages
  • Fixed rndv put protocol to avoid early completion
  • Fixed rndv_put transport selection for device to device scenario
  • Disabled rendezvous pipeline protocol selection when using non-contiguous buffer
  • Fixed crash in rendezvous protocol rkey pack after failed memory registration

RDMA CORE (IB, ROCE, etc.)

  • Fixed compilation failure when DevX is explicitly disabled
  • Fixed crash when using PCIe relaxed ordering
  • Fixed remote access error with rc_verbs transport
  • Fixed endpoint address management in unified mode
  • Fixed assertion failure when configured with UCX_IB_ADDR_TYPE=ib_global
  • Fixed overwritten MD attribute capabilities when querying a device
  • Fixed ibv_reg_mr error by registering memory in rcache callback
  • Disabled MR multithreading registration
  • Fixed mlx5 WQE posting error due to compiler memory copy optimizations

TCP

  • Fixed assymetric lanes selection issue due to inconsistent device listing

GPU (CUDA, ROCM)

  • Fixed compilation flags to support ROCm 6.0
  • Fixed values of D2H_THRESH and latencey params
  • Fixed Cuda memory support for iov datatype
  • Increased max number of agents in ROCm
  • Fixed cuda_ipc transport being disabled if a CUDA device is not set during initialization

Shared Memoey

  • Fixed posix and cma transport selection by enhancing reachability checks
  • Fixed UGNI build failure
  • Fixed latency overhead for knem and cma transports
  • Fixed possible out-of-order issue in mm_iface

UCS

  • Fixed a deadlock when forked debugger is attached during an error in rcache operation
  • Fixed crash due to passing null pointer to log function
  • Fixed crash due to incorrect hashing method
  • Fixed crash in configuration parser cleanup by moving it after profiler cleanup
  • Fixed floating point division by zero during protocols initialization

UCM

  • Fixed occasional crash in bisto hooks by adding a lock before hooking
  • Fixed compilation error when building on PPC64

Java

  • Fixed go tests by setting CUDA device before allocating CUDA memory
  • Fixed perftest error detection and hanging issue

Tools

  • Fixed cpu model type for AMD Genoa in ucx_info
  • Enhanced multi-thread test output

Build

  • Fixed JUCX package publishing, so it will include support for ARM
  • Fixed ROCm building and testing
  • Removed libnvidia-compute version dependency
  • Removed libibmad/libumad from default build configuration to avoid runtime dependency

Packaging

  • Fixed already existing target error when using cmake find_package(ucx) twice

v1.16.0 RC5

03 Apr 10:56
e20264e
Compare
Choose a tag to compare
v1.16.0 RC5 Pre-release
Pre-release

1.16.0 RC5 (April 02, 2024)

Features:

UCS

  • Added support for PCIe gen5 bandwidth detection

Bugfixes:

UCP

  • Fixed rndv_put transport selection for device to device scenario

RDMA CORE (IB, ROCE, etc.)

  • Disabled MR multithreading registration