Optimal Data Collection for Machine Learning
ORAL
Abstract
Machine learning refers to a collection of computational techniques for identifying or learning patterns in data. Although existing techniques are most effective on large data sets, there is growing interest in applying methods on smaller ones. We consider the application of machine learning to predicting ambient sound levels in the contiguous United States from GIS data. The challenge is limited availability of training data from which to construct a model--data collection in this case is both cost and time expensive. This leads us to consider two questions: First, how to best validate a machine learning model with limited training data and two, given additional data can we measurably improve the accuracy of the model. We create an ensemble of models that perform equally well as measured by leave-one-out cross validation on our initial training set. However, these models give wildly different predictions for areas in the central region of the country. By collecting additional data in cropland areas in Utah, we were able to improve the predictions of our machine learning model to other, geographically similar regions of the country.
–
Presenters
-
Casie Gaza
Brigham Young University
Authors
-
Mark K. Transtrum
Brigham Young Univ - Provo, Brigham Young University
-
Kent L Gee
Brigham Young Univ - Provo, Brigham Young University
-
Katrina L Pedersen
Brigham Young University
-
Brooks A Butler
Brigham Young University
-
Casie Gaza
Brigham Young University