ConnectX-2 CORE-Direct Enabled Asynchronous Broadcast Collective Communications
Citations Over TimeTop 10% of 2011 papers
Abstract
This paper describes the design and implementation of InfiniBand (IB) CORE-Direct based blocking and nonblocking broadcast operations within the Cheetah collective operation framework. It describes a novel approach that fully offloads collective operations and employs only user-supplied buffers. For a 64 rank communicator, the latency of CORE-Direct based hierarchical algorithm is better than production grade Message Passing Interface (MPI) implementations, 150% better than the default Open MPI algorithm and 115% better than the shared memory optimized MVAPICH implementation for a one kilo-byte (KB) message, and for eight mega-bytes (MB) it is 48% and 64% better, respectively. Flat-topology broadcast achieves 99.9% overlap in a polling based communication-computation test, and 95.1% overlap for a wait based test, compared with 92.4% and 17.0%, respectively, for a similar Central Processing Unit (CPU) based implementation.
Related Papers
- → High performance RDMA-based MPI implementation over InfiniBand(2003)354 cited
- → High Performance RDMA-Based MPI Implementation over InfiniBand(2004)160 cited
- → High performance RDMA-based MPI implementation over InfiniBand(2003)8 cited
- → LW-RDMA(2015)
- → D-RDMALib: InfiniBand-based RDMA Library for Distributed Cluster Applications(2023)