NotesFAQContact Us
Search Tips
Peer reviewed Peer reviewed
Direct linkDirect link
ERIC Number: EJ1132360
Record Type: Journal
Publication Date: 2017
Pages: 12
Abstractor: As Provided
ISSN: ISSN-0895-7347
Statistically Comparing the Performance of Multiple Automated Raters across Multiple Items
Kieftenbeld, Vincent; Boyer, Michelle
Applied Measurement in Education, v30 n2 p117-128 2017
Automated scoring systems are typically evaluated by comparing the performance of a single automated rater item-by-item to human raters. This presents a challenge when the performance of multiple raters needs to be compared across multiple items. Rankings could depend on specifics of the ranking procedure; observed differences could be due to random sampling of items and/or responses in the validation sets. Any statistical hypothesis test of the differences in rankings needs to be appropriate for use with rater statistics and adjust for multiple comparisons. This study considered different statistical methods to evaluate differences in performance across multiple raters and items. These methods are illustrated leveraging data from the 2012 Automated Scoring Assessment Prize competitions. Using average rankings to test for significant differences in performance between automated and human raters, findings show that most automated raters did not perform statistically significantly different from human-to-human inter-rater agreement for essays but they did perform differently on short-answer items. Differences in average rankings between most automated raters were not statistically significant, even when their observed performance differed substantially.
Routledge. Available from: Taylor & Francis, Ltd. 530 Walnut Street Suite 850, Philadelphia, PA 19106. Tel: 800-354-1420; Tel: 215-625-8900; Fax: 215-207-0050; Web site:
Publication Type: Journal Articles; Reports - Research
Education Level: N/A
Audience: N/A
Language: English
Sponsor: N/A
Authoring Institution: N/A
Grant or Contract Numbers: N/A