APS Logo

Sample Size Determination for Machine Learning Surrogates of Molecular Dynamics Simulations

ORAL

Abstract

The performance promise of machine learning (ML) surrogates of molecular dynamics (MD) simulations of soft materials is significant, but generally comes at the cost of acquiring large training datasets to learn the complex relationships between input soft material attributes and output properties. Under the constraint of limited high-performance computing resources, optimizing the selection of the simulations for creating well-represented training datasets becomes paramount. Using an artificial neural network based, well-trained ML surrogate for MD simulations of confined electrolytes, we explore the possibility of balancing surrogate accuracy with the cost of training. The dependence of performance metrics such as accuracy, mean-squared error, and speed-up on the training data volumes is investigated. We show that a decrease in the training dataset size by 8x leads to a drop in the surrogate accuracy by ~4%. We determine the relative importance of different input features, and introduce a sample size reduction strategy to further reduce the training size while maintaining the desired levels of accuracy and robustness in surrogate predictions. The link between the uncertainties present in the ground truth data and the surrogate performance is also examined.

Presenters

  • Fanbo Sun

    Indiana University Bloomington

Authors

  • Fanbo Sun

    Indiana University Bloomington

  • Kadupitiya JCS

    Microsoft

  • Vikram Jadhao

    Indiana University Bloomington