Sample Size Determination for Machine Learning Surrogates of Molecular Dynamics Simulations
ORAL
Abstract
The performance promise of machine learning (ML) surrogates of molecular dynamics (MD) simulations of soft materials is significant, but generally comes at the cost of acquiring large training datasets to learn the complex relationships between input soft material attributes and output properties. Under the constraint of limited high-performance computing resources, optimizing the selection of the simulations for creating well-represented training datasets becomes paramount. Using an artificial neural network based, well-trained ML surrogate for MD simulations of confined electrolytes, we explore the possibility of balancing surrogate accuracy with the cost of training. The dependence of performance metrics such as accuracy, mean-squared error, and speed-up on the training data volumes is investigated. We show that a decrease in the training dataset size by 8x leads to a drop in the surrogate accuracy by ~4%. We determine the relative importance of different input features, and introduce a sample size reduction strategy to further reduce the training size while maintaining the desired levels of accuracy and robustness in surrogate predictions. The link between the uncertainties present in the ground truth data and the surrogate performance is also examined.
–
Presenters
-
Fanbo Sun
Indiana University Bloomington
Authors
-
Fanbo Sun
Indiana University Bloomington
-
Kadupitiya JCS
Microsoft
-
Vikram Jadhao
Indiana University Bloomington