A Multiple Imputation Approach for Handling Missing Data in Classification and Regression Trees
Danielle M. Rodgers* [Contact author]
Arizona State University, Tempe, AZ 85281, USA
dmrodge3@asu.edu
Ross Jacobucci
University of Notre Dame, Notre Dame, IN 46556, USA
rjacobuc@nd.edu
Kevin J. Grimm
Arizona State University, Tempe, AZ 85281, USA
kjgrimm@asu.edu
Abstract: Decision trees (DTs) is a machine learning technique that searches the predictor space for the variable and observed value that leads to the best prediction when the data are split into two nodes based on the variable and splitting value. The algorithm repeats its search within each partition of the data until a stopping rule ends the search. Missing data can be problematic in DTs because of an inability to place an observation with a missing value into a node based on the chosen splitting variable. Moreover, missing data can alter the selection process because of its inability to place observations with missing values. Simple missing data approaches (e.g., listwise deletion, majority rule, and surrogate split) have been implemented in DT algorithms; however, more sophisticated missing data techniques have not been thoroughly examined. We propose a modified multiple imputation approach to handling missing data in DTs, and compare this approach with simple missing data approaches as well as single imputation and a multiple imputation with prediction averaging via Monte Carlo Simulation. This study evaluated the performance of each missing data approach when data were MAR or MCAR. The proposed multiple imputation approach and surrogate splits had superior performance with the proposed multiple imputation approach performing best in the more severe missing data conditions. We conclude with recommendations for handling missing data in DTs.
Keywords: Multiple Imputation • Classification and Regression Tree (CART) • Missing Data
DOI: https://doi.org/10.35566/jbds/v1n1/p6
Fulltext: Read online
PDF: v1n1p6.pdf
Citation: (APA style) Rodgers, D. M., Jacobucci, R., & Grimm, K. J. (2021). A Multiple Imputation Approach for Handling Missing Data in Classification and Regression Trees. Journal of Behavioral Data Science, 1(1), 127–153. https://doi.org/10.35566/jbds/v1n1/p6
BibTex format:
@Article{Rodgers2021, author = {Danielle M. Rodgers and Ross Jacobucci and Kevin J. Grimm}, journal = {Journal of Behavioral Data Science}, title = {A Multiple Imputation Approach for Handling Missing Data in Classification and Regression Trees}, year = {2021}, month = {may}, number = {1}, pages = {127--153}, volume = {1}, doi = {10.35566/jbds/v1n1/p6}, publisher = {International Society for Data Science and Analytics}, }