Propensity Score Estimation with Data Mining Techniques: Alternatives to Logistic Regression.

Keller, Bryan S. B.; Kim, Jee-Seon; Steiner, Peter M.

Notes FAQ Contact Us

Back to results

Peer reviewed
PDF on ERIC

Download full text

ERIC Number: ED564102

Record Type: Non-Journal

Publication Date: 2013

Pages: 7

Abstractor: ERIC

ISBN: N/A

ISSN: N/A

EISSN: N/A

Propensity Score Estimation with Data Mining Techniques: Alternatives to Logistic Regression

Keller, Bryan S. B.; Kim, Jee-Seon; Steiner, Peter M.

Society for Research on Educational Effectiveness

Propensity score analysis (PSA) is a methodological technique which may correct for selection bias in a quasi-experiment by modeling the selection process using observed covariates. Because logistic regression is well understood by researchers in a variety of fields and easy to implement in a number of popular software packages, it has traditionally been the most frequently used method for modeling selection in PSA. There are, however, circumstances under which logistic regression may not perform well. The most important disadvantage of a propensity score (PS) estimation approach that uses logistic regression is the need for iterative specification of the model, which can be rather time intensive and comes with no guarantee of success, in particular with many covariates. A careful review of the burgeoning PS estimation literature has shown that the neural network and the support vector machine (SVM) are promising alternatives to logistic regression which avoid the need for respecification because they automatically model nonlinearities in the selection response surface, and are well suited for high-dimensional data. These two methods, although promising, are heretofore largely or completely empirically untested in this context. Through simulation, this study examines the conditions under which logistic regression is relatively robust to model misspecification and the conditions under which the neural network or the support vector machine will provide a less biased estimate of the effect of a treatment. Researchers evaluate through simulation, and make available a program written in R which carries out a cross-validated grid search for the optimal tuning parameters for the data mining methods based on maximizing the balance as opposed to minimizing the prediction error. The results of the simulation study clearly demonstrate that the misspecification of the PS model via logistic regression leads to the potential for gross bias in the estimate of the treatment effect when there are nonlinear or nonadditive confounders. The data mining techniques were less biased and had smaller mean square error in that case. The simulation study further explores the effect of the number of covariates and the number and strength of higher order confounders on the performance of the PS estimation methods. The authors provide recommendations based on the simulation study results in hopes of guiding researchers to make informed decisions about which propensity score estimation technique to use for their given situation in order to maximize the accuracy and efficiency of research. A table is appended.

Descriptors: Probability, Scores, Statistical Analysis, Statistical Bias, Quasiexperimental Design, Regression (Statistics), Mathematical Models, Research Methodology, Evaluation Methods, Simulation, Error of Measurement, Accuracy

Society for Research on Educational Effectiveness. 2040 Sheridan Road, Evanston, IL 60208. Tel: 202-495-0920; Fax: 202-640-4401; e-mail: inquiries@sree.org; Web site: http://www.sree.org

Publication Type: Reports - Research

Education Level: N/A

Audience: N/A

Language: English

Sponsor: N/A

Authoring Institution: Society for Research on Educational Effectiveness (SREE)

Grant or Contract Numbers: N/A

Privacy | Copyright | Contact Us | Selection Policy | API