Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark
ACM SIGMETRICS Performance Evaluation Review2011Vol. 38(4), pp. 23–29
Citations Over TimeTop 10% of 2011 papers
Abstract
We present the performance analysis of a port of the LU benchmark from the NAS Parallel Benchmark (NPB) suite to NVIDIA's Compute Unified Device Architecture (CUDA), and report on the optimisation efforts employed to take advantage of this platform. Execution times are reported for several different GPUs, ranging from low-end consumergrade products to high-end HPC-grade devices, including the Tesla C2050 built on NVIDIA's Fermi processor. We also utilise recently developed performance models of LU to facilitate a comparison between future large-scale distributed clusters of GPU devices and existing clusters built on traditional CPU architectures, including a quad-socket, quad-core AMD Opteron cluster and an IBM BlueGene/P.
Related Papers
- → Accelerating linpack with CUDA on heterogenous clusters(2009)140 cited
- → Multi-level parallelism for incompressible flow computations on GPU clusters(2012)63 cited
- → Scalability of Self-organizing Maps on a GPU cluster using OpenCL and CUDA(2012)34 cited
- IBM Power Systems 775 for Aix and Linux Hpc Solution(2012)
- → Study on GPU-accelerated extraction of interconnects parasitic using CUDA and MPI(2010)