Ensemble Machine Learning and Synthetic Data Augmentation for Reliable Collision Prediction in Chaotic Advection
POSTER
Abstract
The ubiquitous random movement of particles within a deterministic flow field is known as chaotic advection. Due to its non-linear nature, predicting and validating chaotic advection models has been computationally expensive.
This research aims to develop a machine learning (ML) model trained on data generated from Aref model (Aref, 1984) to predict collisions between fluid particles in a chaotic advection flow. The primary challenge is the highly imbalanced dataset, where non-collision instances vastly outnumber collision instances, leading to poor performance. To address this, we employ the Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbors (ENN) to generate synthetic data for collision instances and enhance data quality by removing noisy samples. With the new balanced data, we establish a baseline using logistic regression. Following this, we explore tree-based models to capture non-linear relationships, then random forests to reduce overfitting and, finally, gradient boosting to incrementally improve accuracy by correcting previous errors. Despite these sophisticated models, individual performance may still be suboptimal due to the inherent complexity of chaotic advection. Hence, we adopt a voting classifier, which integrates logistic regression, decision trees, random forests, and gradient boosting models. This ensemble approach leverages the strengths of each model, aiming for improved predictive performance and robustness.
This research aims to develop a machine learning (ML) model trained on data generated from Aref model (Aref, 1984) to predict collisions between fluid particles in a chaotic advection flow. The primary challenge is the highly imbalanced dataset, where non-collision instances vastly outnumber collision instances, leading to poor performance. To address this, we employ the Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbors (ENN) to generate synthetic data for collision instances and enhance data quality by removing noisy samples. With the new balanced data, we establish a baseline using logistic regression. Following this, we explore tree-based models to capture non-linear relationships, then random forests to reduce overfitting and, finally, gradient boosting to incrementally improve accuracy by correcting previous errors. Despite these sophisticated models, individual performance may still be suboptimal due to the inherent complexity of chaotic advection. Hence, we adopt a voting classifier, which integrates logistic regression, decision trees, random forests, and gradient boosting models. This ensemble approach leverages the strengths of each model, aiming for improved predictive performance and robustness.
Presenters
-
Barath Sundaravadivelan
Arizona State University
Authors
-
Barath Sundaravadivelan
Arizona State University
-
Alberto Scotti
Arizona State University