Evaluating Equating Results in the Non-Equivalent Groups with Anchor Test Design Using Equipercentile and Equity Criteria.

Duong, Minh Quang

Testing programs often use multiple test forms of the same test to control item exposure and to ensure test security. Although test forms are constructed to be as similar as possible, they often differ. Test equating techniques are those statistical methods used to adjust scores obtained on different test forms of the same test so that they are comparable and can be used interchangeably. In this study, the performance of four commonly used equating methods under the non-equivalent group with anchor test (NEAT) design--the frequency estimation equipercentile method (FE), the chain equipercentile method (CE), the item response theory (IRT) true score method (TS), and the IRT observed score method (OS)--were examined. In order to evaluate equating results, four evaluation criteria--the equipercentile criterion (EP), the full equity criterion (E), the first-order equity criterion (E[subscript 1]), and the second-order equity criterion (E[subscript 2])--were used. Simulated data were used in various conditions of form and group differences. Several major findings were obtained in this study. When the distributions used to simulate ability for the groups were equal, the four methods produced similar results, regardless of the criterion used. When group difference existed in the distributions used to simulate the data, the results produced by different methods diverged significantly when the EP, E, and E[subscript 1] criteria were used. The difference was small when the E[subscript 2] criterion was used. In general, the OS method outperformed the others in regarding to the EP and E criteria. The TS method performed the best in regarding to the E[subscript 1] criterion followed by the OS, CE, and FE methods. Between the two observed score methods (i.e., FE and CE), which were outperformed by the two IRT methods, the CE method produced much better results and they were close to those produced by the two IRT methods. The FE method produced the worst results, regardless of the criterion used. It was also found that test form difference had clear effects on all methods, regardless of the criterion used. Larger difference between test forms led to worst equating results. While the two IRT methods were not clearly affected by group differences in the generating distributions, the two observed score equating methods were. Larger group differences produced worse equating results obtained from the CE and the FE methods. In addition, the impacts of group differences were much stronger for the FE method than for the CE method. Group and form interaction effects were not found for the IRT methods. They were, however, present for the FE and CE methods although those effects were small. When evaluated with the E[subscript 2] criterion, the four equating methods produced results that were not better than those obtained from using directly raw scores from test forms without equating. These results are discussed in more details and some recommendations are made for equating practice. Limitations of the study and suggestions for further research are also presented. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com/en-US/products/dissertations/individuals.shtml.]