Extreme-scale computing for pseudo-spectral codes using GPUs and fine-grained asynchronism, with application to turbulence

ORAL

Abstract

As computing advances to the pre-Exascale era dominated by accelerators such as Graphical Processing Units, a substantial re-thinking is necessary for many communication-intensive applications, including turbulence simulations based on pseudo-spectral methods. We have developed an asynchronous algorithm with one-dimensional domain decomposition optimized for machines with large CPU memory and fast GPUs, in particular SUMMIT at the Oak Ridge National Laboratory, which consists of IBM Power-9 CPU's and NVIDIA V100 GPU's. Data located in the CPU memory are processed in a fine-grained (batch) manner by overlapping high BW NVLINK transfers, with fast GPU computations and high BW system interconnect allowing a much larger problem to be run than the much smaller GPU memory might suggest. Pinned memory and zerocopy approaches are used to transfer strided data between the GPU and CPU obtaining high NVLINK throughput. Several advanced communication protocols are explored in order to obtain maximum network throughput for collective communication. Benchmarks at the scale of $12288^3$ grid points on 1024 SUMMIT nodes show good weak scaling, with a speedup of over 3X compared to the multi-threaded CPU-only algorithm.

Presenters

  • Kiran Ravikumar

    Georgia Tech

Authors

  • Kiran Ravikumar

    Georgia Tech

  • David Appelhans

    IBM Research

  • Pui-Kuen Yeung

    Georgia Inst of Tech, Georgia Institute of Technology, Atlanta, USA