Towards exascale multiphase compressible flow simulation via scalable interface capturing-based solvers and GPU acceleration
POSTER
Abstract
OLCF Frontier broke the exascale barrier in June 2022.
One must judiciously offload and carefully implement CFD algorithms to effectively use exascale resources for flow simulations.
We present a strategy that brings us closer to utilizing these resources in the context of multiphase compressible flows.
All implementation is in the Multicomponent Flow Code (MFC; open source).
MFC uses a finite volume method with WENO5-based interface capturing and the HLLC Riemann solver.
We offload all compute kernels to GPUs via OpenACC and argue that similar explicit methods on current accelerators must tack this way.
The kernels are fine-tuned via OpenACC decorations and thread serialization where appropriate.
This results in high arithmetic intensity, realizing about 55\% of the peak GPU FLOPs.
We observe a 500-times speed-up on an NVIDIA A100 over a single core of a modern Intel CPU.
This corresponds to about a 50-times speed-up over a CPU node for a GPU node in a modern supercomputer (e.g., SDSC Expanse).
This implementation demonstrates ideal weak scaling to at least 13824 GPUs on OLCF Summit.
CUDA-aware MPI enables remote direct data access, improving strong scaling behavior.
We probe performance against various CPU architectures, including ARM, x86, and Power
One must judiciously offload and carefully implement CFD algorithms to effectively use exascale resources for flow simulations.
We present a strategy that brings us closer to utilizing these resources in the context of multiphase compressible flows.
All implementation is in the Multicomponent Flow Code (MFC; open source).
MFC uses a finite volume method with WENO5-based interface capturing and the HLLC Riemann solver.
We offload all compute kernels to GPUs via OpenACC and argue that similar explicit methods on current accelerators must tack this way.
The kernels are fine-tuned via OpenACC decorations and thread serialization where appropriate.
This results in high arithmetic intensity, realizing about 55\% of the peak GPU FLOPs.
We observe a 500-times speed-up on an NVIDIA A100 over a single core of a modern Intel CPU.
This corresponds to about a 50-times speed-up over a CPU node for a GPU node in a modern supercomputer (e.g., SDSC Expanse).
This implementation demonstrates ideal weak scaling to at least 13824 GPUs on OLCF Summit.
CUDA-aware MPI enables remote direct data access, improving strong scaling behavior.
We probe performance against various CPU architectures, including ARM, x86, and Power
Presenters
-
Anand Radhakrishnan
Georgia Tech
Authors
-
Anand Radhakrishnan
Georgia Tech
-
Henry Le Berre
Georgia Tech
-
Spencer H Bryngelson
Georgia Tech, Georgia Institute of Technology