Publication Date: 2016-Feb-22
Mode Comparability Study Based on Spring 2015 Operational Test Data
Liu, Junhui; Brown, Terran; Chen, Jianshen; Ali, Usama; Hou, Likun; Costanzo, Kate
Partnership for Assessment of Readiness for College and Careers
The Partnership for Assessment of Readiness for College and Careers (PARCC) is a state-led consortium working to develop next-generation assessments that more accurately, compared to previous assessments, measure student progress toward college and career readiness. The PARCC assessments include both English Language Arts/Literacy (ELA/L) and mathematics assessments in grades 3 to 8 and high school. This mode comparability study was conducted to address the following two questions: (1) Is the construct invariant between the two modes of test administration?; and (2) Given that the construct remains the same, is student performance (e.g., mean, median, various quartiles) similar between the two modes? To address these two research questions, a series of analyses were conducted using data from the spring 2015 operational tests of mathematics grades 5 and 7, Algebra I, Geometry, Algebra II, and ELA/L grades 3, 7 and 9. School districts selected the test administration mode, therefore the resulting CBT and PBT test-taker groups are not randomly equivalent. The following analyses were conducted for this mode comparability study: (1) Z-score comparisons (Section 3.2) to evaluate the similarity of item performance of the common items across modes; (2) Differential item functioning (DIF; Section 3.3) to identify common items with differences in performance once test takers are matched on ability; (3) Comparison of IRT item parameter estimates (Sections 5 and 7) to evaluate the similarity of item difficulty estimates and item discrimination parameter estimates based on separate within-mode IRT calibrations; and (4) Summary test statistics (Section 6) to compare "test-level" mean performance across modes. This analysis included effect sizes to determine the magnitude of possible mode effects. The item level analyses showed that the differences in item difficulties were small for the majority of items. However, the Prose Constructed Response (PCR) trait items in ELA/L had larger differences in item difficulties compared to other item types; all differences favored PBT. The difficulties of the common items between modes were strongly correlated in nearly all subjects and grade levels indicating coherence in measuring the same construct. Although a very small percentage of items was identified as having substantial differences across the two modes after accounting for test taker ability, many items were flagged for moderate differences across the two modes favoring PBT for ELA/L grades 3, 7 and 9 as well as for the Geometry test; the majority of these items in ELA/L were PCR trait items. Additional analyses were conducted on student data from the sole state (State S) that provided prior state assessment scores. Prior achievement data were used for adjustment to make the CBT and PBT groups more comparable. The scale score differences were largely reduced for mathematics grade 5, 7 and Algebra I after using the prior achievement data and the scale scores were generally comparable across modes for these tests. However, for other grades, particularly ELA/L grade 9 and Geometry, there were substantial differences in scores across mode.
