Login Home Annual Meeting Journal of Behavioral Data Science Members


A Multiple Imputation Approach for Handling Missing Data in Classification and Regression Trees

Danielle M. Rodgers* [Contact author]
Arizona State University, Tempe, AZ 85281, USA

Ross Jacobucci
University of Notre Dame, Notre Dame, IN 46556, USA

Kevin J. Grimm
Arizona State University, Tempe, AZ 85281, USA

Abstract: Decision trees (DTs) is a machine learning technique that searches the predictor space for the variable and observed value that leads to the best prediction when the data are split into two nodes based on the variable and splitting value. The algorithm repeats its search within each partition of the data until a stopping rule ends the search. Missing data can be problematic in DTs because of an inability to place an observation with a missing value into a node based on the chosen splitting variable. Moreover, missing data can alter the selection process because of its inability to place observations with missing values. Simple missing data approaches (e.g., listwise deletion, majority rule, and surrogate split) have been implemented in DT algorithms; however, more sophisticated missing data techniques have not been thoroughly examined. We propose a modified multiple imputation approach to handling missing data in DTs, and compare this approach with simple missing data approaches as well as single imputation and a multiple imputation with prediction averaging via Monte Carlo Simulation. This study evaluated the performance of each missing data approach when data were MAR or MCAR. The proposed multiple imputation approach and surrogate splits had superior performance with the proposed multiple imputation approach performing best in the more severe missing data conditions. We conclude with recommendations for handling missing data in DTs.

Keywords: Multiple Imputation • Classification and Regression Tree (CART) • Missing Data

DOI: https://doi.org/10.35566/jbds/v1n1/p6

Fulltext: Read online

PDF: v1n1p6.pdf

Citation: (APA style) Rodgers, D. M., Jacobucci, R., & Grimm, K. J. (2021). A Multiple Imputation Approach for Handling Missing Data in Classification and Regression Trees. Journal of Behavioral Data Science, 1(1), 127–153. https://doi.org/10.35566/jbds/v1n1/p6

BibTex format:

  author    = {Danielle M. Rodgers and Ross Jacobucci and Kevin J. Grimm},
  journal   = {Journal of Behavioral Data Science},
  title     = {A Multiple Imputation Approach for Handling Missing Data in Classification and Regression Trees},
  year      = {2021},
  month     = {may},
  number    = {1},
  pages     = {127--153},
  volume    = {1},
  doi       = {10.35566/jbds/v1n1/p6},
  publisher = {International Society for Data Science and Analytics},

ISDSA About Membership Academy Jobs at ISDSA Privacy ISDSA Press About Journal of Behavioral Data Science Books Annual Meeting Current Meeting Donate ISDSA is an exempt organization under section 501(c)(3) of the Internal Revenue Code. To make tax deductible contribution for the growth of ISDSA, click here.