Efficient, lossless compression of atomistic datasets with information theory

Benjamin YU; Daniel Schwalbe-Koda

Efficient, lossless compression of atomistic datasets with information theory

ORAL

Abstract

Machine learning potentials have been increasingly used to predict potential energy surfaces of atomistic systems with high accuracy and efficiency. However, while increasing dataset sizes often lead to improved performance, training models to larger datasets requires expensive computational time. Ideally, we want to find algorithms to compress datasets to reduce training times while avoiding compromising their accuracy. Here, we will describe an algorithm used to compress the dataset based on information theory. First, we will describe the theoretical foundation behind the algorithm and how it more effectively compresses the dataset compared to other widely-used methods. Then, by testing the model's performance on datasets outside the distribution of the training data, we show that our approach systematically leads to richer datasets and models with higher generalization power. The work is distributed as part of the QUESTS package, allowing efficient compression of atomistic datasets.

March 19, 2025, 12:00 PM – March 19, 2025, 12:12 PM

Presenters

Benjamin YU

University of California, Los Angeles

Authors

Benjamin YU

University of California, Los Angeles
Daniel Schwalbe-Koda

UCLA