Achieving Optimal VPIC Performance on Several Modern CPU Architectures

William Nystrom; Douglas Jacobsen

Achieving Optimal VPIC Performance on Several Modern CPU Architectures

POSTER

Abstract

Two significant modifications to the VPIC \footnote{K. J. Bowers, B. J. Albright, L. Yin, B. Bergen, and T. J. T. Kwan, Phys. Plasmas 15, 055703 (2008)} particle advance implementation are being explored. The first is the use of an Array of Structs of Arrays (AoSoA) data structure for the particles which eliminates the need to transpose vector loads of particle data after loading into vector registers and before storing back to memory. The second is the use of a particle sort performed for every timestep which allows particles to be processed as a double loop over cells and the particles in each cell. This second modification allows several optimizations including hoisting the load of interpolation data and the store of current density accumulation data out of the per-cell particle loop. These modifications eliminate a performance bottleneck associated with shuffle and permute operations in data transpose operations and increase the efficiency of VPIC's use of available memory bandwidth. Initial performance results for some of the VPIC particle kernels is greater than 2x. Results for the complete implementation will be presented on several modern architectures including Intel Knights Landing and IBM Power 9.

Authors

William Nystrom

Los Alamos National Laboratory
Douglas Jacobsen

Intel Corporation