Publication Date
| In 2015 | 0 |
| Since 2014 | 0 |
| Since 2011 (last 5 years) | 2 |
| Since 2006 (last 10 years) | 9 |
| Since 1996 (last 20 years) | 28 |
Descriptor
| Scoring | 42 |
| Test Items | 13 |
| Test Construction | 12 |
| Performance Based Assessment | 10 |
| Scores | 10 |
| Elementary Secondary Education | 7 |
| Evaluation Methods | 7 |
| Interrater Reliability | 7 |
| Comparative Analysis | 6 |
| Student Evaluation | 6 |
| More ▼ | |
Source
| Applied Measurement in… | 42 |
Author
| Clauser, Brian E. | 3 |
| Clyman, Stephen G. | 2 |
| Ercikan, Kadriye | 2 |
| Hambleton, Ronald K. | 2 |
| Johnson, Robert L. | 2 |
| Klein, Stephen P. | 2 |
| Margolis, Melissa J. | 2 |
| Sireci, Stephen G. | 2 |
| Attali, Yigal | 1 |
| Becker, Douglas F. | 1 |
| More ▼ | |
Publication Type
| Journal Articles | 42 |
| Reports - Research | 20 |
| Reports - Evaluative | 19 |
| Information Analyses | 3 |
| Speeches/Meeting Papers | 2 |
| Collected Works - General | 1 |
| Reports - Descriptive | 1 |
| Reports - General | 1 |
Education Level
| Grade 5 | 4 |
| Elementary Education | 1 |
| Elementary Secondary Education | 1 |
| Grade 3 | 1 |
| Grade 4 | 1 |
| Grade 6 | 1 |
| Grade 8 | 1 |
Audience
Showing 1 to 15 of 42 results
Kachchaf, Rachel; Solano-Flores, Guillermo – Applied Measurement in Education, 2012
We examined how rater language background affects the scoring of short-answer, open-ended test items in the assessment of English language learners (ELLs). Four native English and four native Spanish-speaking certified bilingual teachers scored 107 responses of fourth- and fifth-grade Spanish-speaking ELLs to mathematics items administered in…
Descriptors: Error of Measurement, English Language Learners, Scoring, Bilingual Teachers
Bridgeman, Brent; Trapani, Catherine; Attali, Yigal – Applied Measurement in Education, 2012
Essay scores generated by machine and by human raters are generally comparable; that is, they can produce scores with similar means and standard deviations, and machine scores generally correlate as highly with human scores as scores from one human correlate with scores from another human. Although human and machine essay scores are highly related…
Descriptors: Scoring, Essay Tests, College Entrance Examinations, High Stakes Tests
Hein, Serge F.; Skaggs, Gary E. – Applied Measurement in Education, 2009
Only a small number of qualitative studies have investigated panelists' experiences during standard-setting activities or the thought processes associated with panelists' actions. This qualitative study involved an examination of the experiences of 11 panelists who participated in a prior, one-day standard-setting meeting in which either the…
Descriptors: Focus Groups, Standard Setting, Cutting Scores, Cognitive Processes
Zhang, Bo; Ohland, Matthew W. – Applied Measurement in Education, 2009
One major challenge in using group projects to assess student learning is accounting for the differences of contribution among group members so that the mark assigned to each individual actually reflects their performance. This research addresses the validity of grading group projects by evaluating different methods that derive individualized…
Descriptors: Monte Carlo Methods, Validity, Student Evaluation, Evaluation Methods
Osborn Popp, Sharon E.; Ryan, Joseph M.; Thompson, Marilyn S. – Applied Measurement in Education, 2009
Scoring rubrics are routinely used to evaluate the quality of writing samples produced for writing performance assessments, with anchor papers chosen to represent score points defined in the rubric. Although the careful selection of anchor papers is associated with best practices for scoring, little research has been conducted on the role of…
Descriptors: Writing Evaluation, Scoring Rubrics, Selection, Scoring
Puhan, Gautam – Applied Measurement in Education, 2009
The purpose of this study is to determine the extent of scale drift on a test that employs cut scores. It was essential to examine scale drift for this testing program because new forms in this testing program are often put on scale through a series of intermediate equatings (known as equating chains). This process may cause equating error to…
Descriptors: Testing Programs, Testing, Measurement Techniques, Item Response Theory
Clauser, Brian E.; Harik, Polina; Margolis, Melissa J.; McManus, I. C.; Mollon, Jennifer; Chis, Liliana; Williams, Simon – Applied Measurement in Education, 2009
Numerous studies have compared the Angoff standard-setting procedure to other standard-setting methods, but relatively few studies have evaluated the procedure based on internal criteria. This study uses a generalizability theory framework to evaluate the stability of the estimated cut score. To provide a measure of internal consistency, this…
Descriptors: Generalizability Theory, Group Discussion, Standard Setting (Scoring), Scoring
McCarty, F. A.; Oshima, T. C.; Raju, Nambury S. – Applied Measurement in Education, 2007
Oshima, Raju, Flowers, and Slinde (1998) described procedures for identifying sources of differential functioning for dichotomous data using differential bundle functioning (DBF) derived from the differential functioning of items and test (DFIT) framework (Raju, van der Linden, & Fleer, 1995). The purpose of this study was to extend the procedures…
Descriptors: Rating Scales, Test Bias, Scoring, Test Items
Hogan, Thomas P.; Murphy, Gavin – Applied Measurement in Education, 2007
We determined the recommendations for preparing and scoring constructed-response (CR) test items in 25 sources (textbooks and chapters) on educational and psychological measurement. The project was similar to Haladyna's (2004) analysis for multiple-choice items. We identified 12 recommendations for preparing CR items given by multiple sources,…
Descriptors: Test Items, Scoring, Test Construction, Educational Indicators
Skorupski, William P.; Hambleton, Ronald K. – Applied Measurement in Education, 2005
Panelists in an operational standard-setting study were asked to share their thoughts in written form at important points in the process itself--before the meeting started, after training, after completing the 1st and 2nd sets of ratings, immediately following the discussion between Rounds 1 and 2, and so on. The item mapping method of Plake and…
Descriptors: Grade 5, Grade 6, Language Tests, Test Items
Williamson, David M.; Bejar, Isaac I.; Sax, Anne – Applied Measurement in Education, 2004
As automated scoring of complex constructed-response examinations reaches operational status, the process of evaluating the quality of resultant scores, particularly in contrast to scores of expert human graders, becomes as complex as the data itself. Using a vignette from the Architectural Registration Examination (ARE), this article explores the…
Descriptors: Validity, Scoring, Scores, Evaluation Methods
Penfield, Randall D.; Miller, Jeffrey M. – Applied Measurement in Education, 2004
As automated scoring of complex constructed-response examinations reaches operational status, the process of evaluating the quality of resultant scores, particularly in contrast to scores of expert human graders, becomes as complex as the data itself. Using a vignette from the Architectural Registration Examination (ARE), this article explores the…
Descriptors: Student Evaluation, Evaluation Methods, Content Validity, Scoring
Johnson, Robert L.; Penny, Jim; Fisher, Steve; Kuhs, Therese – Applied Measurement in Education, 2003
When raters assign different scores to a performance task, a method for resolving rating differences is required to report a single score to the examinee. Recent studies indicate that decisions about examinees, such as pass/fail decisions, differ across resolution methods. Previous studies also investigated the interrater reliability of…
Descriptors: Test Reliability, Test Validity, Scores, Interrater Reliability
Peer reviewedKeller, Lisa A.; Swaminathan, Hariharan; Sireci, Stephen G. – Applied Measurement in Education, 2003
Evaluated two strategies for scoring context-dependent test items: ignoring the depending and scoring dichotomously or modeling the dependence through polytomous scoring. Results for data from 38,965 examinees taking a professional examination show that dichotomous scoring may overestimate test information, but polytomous scoring may underestimate…
Descriptors: Adults, Licensing Examinations (Professions), Scoring, Test Items
Peer reviewedZenisky, April L.; Sireci, Stephen G. – Applied Measurement in Education, 2002
Reviews and illustrates some of the current technological developments in computer-based testing, focusing on novel item formats and automated scoring methodologies. The review shows a number of innovations being researched and implemented. (SLD)
Descriptors: Educational Innovation, Educational Technology, Elementary Secondary Education, Large Scale Assessment

Direct link
