Download full text
Download full text
ERIC Number: ED564102
Record Type: Non-Journal
Publication Date: 2013
Reference Count: 9
Propensity Score Estimation with Data Mining Techniques: Alternatives to Logistic Regression
Keller, Bryan S. B.; Kim, Jee-Seon; Steiner, Peter M.
Society for Research on Educational Effectiveness
Propensity score analysis (PSA) is a methodological technique which may correct for selection bias in a quasi-experiment by modeling the selection process using observed covariates. Because logistic regression is well understood by researchers in a variety of fields and easy to implement in a number of popular software packages, it has traditionally been the most frequently used method for modeling selection in PSA. There are, however, circumstances under which logistic regression may not perform well. The most important disadvantage of a propensity score (PS) estimation approach that uses logistic regression is the need for iterative specification of the model, which can be rather time intensive and comes with no guarantee of success, in particular with many covariates. A careful review of the burgeoning PS estimation literature has shown that the neural network and the support vector machine (SVM) are promising alternatives to logistic regression which avoid the need for respecification because they automatically model nonlinearities in the selection response surface, and are well suited for high-dimensional data. These two methods, although promising, are heretofore largely or completely empirically untested in this context. Through simulation, this study examines the conditions under which logistic regression is relatively robust to model misspecification and the conditions under which the neural network or the support vector machine will provide a less biased estimate of the effect of a treatment. Researchers evaluate through simulation, and make available a program written in R which carries out a cross-validated grid search for the optimal tuning parameters for the data mining methods based on maximizing the balance as opposed to minimizing the prediction error. The results of the simulation study clearly demonstrate that the misspecification of the PS model via logistic regression leads to the potential for gross bias in the estimate of the treatment effect when there are nonlinear or nonadditive confounders. The data mining techniques were less biased and had smaller mean square error in that case. The simulation study further explores the effect of the number of covariates and the number and strength of higher order confounders on the performance of the PS estimation methods. The authors provide recommendations based on the simulation study results in hopes of guiding researchers to make informed decisions about which propensity score estimation technique to use for their given situation in order to maximize the accuracy and efficiency of research. A table is appended.
Descriptors: Probability, Scores, Statistical Analysis, Statistical Bias, Quasiexperimental Design, Regression (Statistics), Mathematical Models, Research Methodology, Evaluation Methods, Simulation, Error of Measurement, Accuracy
Society for Research on Educational Effectiveness. 2040 Sheridan Road, Evanston, IL 60208. Tel: 202-495-0920; Fax: 202-640-4401; e-mail: firstname.lastname@example.org; Web site: http://www.sree.org
Publication Type: Reports - Research
Education Level: N/A
Authoring Institution: Society for Research on Educational Effectiveness (SREE)