Publication Date
| In 2015 | 1 |
| Since 2014 | 17 |
| Since 2011 (last 5 years) | 113 |
| Since 2006 (last 10 years) | 235 |
| Since 1996 (last 20 years) | 369 |
Descriptor
| Test Theory | 999 |
| Test Construction | 217 |
| Test Items | 207 |
| Test Reliability | 201 |
| Test Validity | 201 |
| Scores | 144 |
| Psychometrics | 143 |
| Higher Education | 133 |
| Mathematical Models | 132 |
| Item Analysis | 126 |
| More ▼ | |
Source
Author
| Mislevy, Robert J. | 21 |
| Zimmerman, Donald W. | 15 |
| van der Linden, Wim J. | 15 |
| Sinharay, Sandip | 9 |
| Haladyna, Tom | 7 |
| Wilcox, Rand R. | 7 |
| Williams, Richard H. | 7 |
| Yen, Wendy M. | 7 |
| Brennan, Robert L. | 6 |
| Huynh, Huynh | 6 |
| More ▼ | |
Publication Type
Showing 1 to 15 of 999 results
Sinharay, Sandip; Haberman, Shelby J. – International Journal of Testing, 2014
Recently there has been an increasing level of interest in subtest scores, or subscores, for their potential diagnostic value. Haberman (2008) suggested a method to determine if a subscore has added value over the total score. Researchers have often been interested in the performance of subgroups--for example, those based on gender or…
Descriptors: Scores, Achievement Tests, Language Tests, English (Second Language)
Sinharay, Sandip – Educational Measurement: Issues and Practice, 2014
Brennan (Brennan, R. L., 2012) noted that users of test scores often want (indeed, demand) that subscores be reported, along with total test scores, for diagnostic purposes. Haberman (Haberman, S. J., 2008) suggested a method based on classical test theory (CTT) to determine if subscores have added value over the total score. According to this…
Descriptors: Scores, Test Theory, Test Interpretation
Development of Nonword and Irregular Word Lists for Australian Grade 3 Students Using Rasch Analysis
Callinan, Sarah; Cunningham, Everarda; Theiler, Stephen – Australian Journal of Learning Difficulties, 2014
Many tests used in educational settings to identify learning difficulties endeavour to pick up only the lowest performers. Yet these tests are generally developed within a Classical Test Theory (CTT) paradigm that assumes that data do not have significant skew. Rasch analysis is more tolerant of skew and was used to validate two newly developed…
Descriptors: Foreign Countries, Reading Tests, Item Response Theory, Elementary School Students
Longford, Nicholas T. – Journal of Educational and Behavioral Statistics, 2014
A method for medical screening is adapted to differential item functioning (DIF). Its essential elements are explicit declarations of the level of DIF that is acceptable and of the loss function that quantifies the consequences of the two kinds of inappropriate classification of an item. Instead of a single level and a single function, sets of…
Descriptors: Test Items, Test Bias, Simulation, Hypothesis Testing
Sinharay, Sandip – Journal of Educational Measurement, 2014
Brennan noted that users of test scores often want (indeed, demand) that subscores be reported, along with total test scores, for diagnostic purposes. Haberman suggested a method based on classical test theory (CTT) to determine if subscores have added value over the total score. One way to interpret the method is that a subscore has added value…
Descriptors: Scores, Test Theory, Classification, Cutting Scores
Allalouf, Avi – International Journal of Testing, 2014
The Quality Control (QC) Guidelines are intended to increase the efficiency, precision, and accuracy of the scoring, analysis, and reporting process of testing. The QC Guidelines focus on large-scale testing operations where multiple forms of tests are created for use on set dates. However, they may also be used for a wide variety of other testing…
Descriptors: Quality Control, Scoring, Test Theory, Scores
Fan, Xitao; Sun, Shaojing – Journal of Early Adolescence, 2014
In adolescence research, the treatment of measurement reliability is often fragmented, and it is not always clear how different reliability coefficients are related. We show that generalizability theory (G-theory) is a comprehensive framework of measurement reliability, encompassing all other reliability methods (e.g., Pearson "r,"…
Descriptors: Generalizability Theory, Measurement, Reliability, Correlation
Maydeu-Olivares, Alberto – Measurement: Interdisciplinary Research and Perspectives, 2013
In this rejoinder, Maydeu-Olivares states that, in item response theory (IRT) measurement applications, the application of goodness-of-fit (GOF) methods informs researchers of the discrepancy between the model and the data being fitted (the room for improvement). By routinely reporting the GOF of IRT models, together with the substantive results…
Descriptors: Goodness of Fit, Models, Evaluation Methods, Item Response Theory
Lambert, Matthew C.; Hurley, Kristin Duppong; Tomlinson, M. Michele Athay; Stevens, Amy L. – Child & Youth Care Forum, 2013
Background: A client's motivation to receive services is significantly related to seeking services, remaining in services, and improved outcomes. The Motivation for Youth Treatment Scale (MYTS) is one of the few brief measures used to assess motivation for mental health treatment. Objective: To investigate if the psychometric properties of…
Descriptors: Motivation, Mental Health, Health Services, Access to Health Care
Snyder, Patricia A.; Hemmeter, Mary Louise; Fox, Lise; Bishop, Crystal Crowe; Miller, M. David – Journal of Early Intervention, 2013
Fidelity assessment has received renewed attention in recent years, particularly as distinctions have been made in implementation science between intervention fidelity and implementation fidelity. Considering both types of fidelity has been recommended when developing fidelity instruments. In the present article, we describe development of the…
Descriptors: Fidelity, Psychometrics, Rating Scales, Program Implementation
Holland, Paul W. – Journal of Educational Measurement, 2013
While agreeing with van der Linden (this issue) that test equating needs better theoretical underpinnings, my comments criticize several aspects of his article. His examples are, for the most part, worthless; he does not use well-established terminology correctly; his view of 100 years of attempts to give a theoretical basis for equating is…
Descriptors: Equated Scores, Test Theory, Transformations (Mathematics), Computation
van der Linden, Wim J. – Journal of Educational Measurement, 2013
In spite of all of the technical progress in observed-score equating, several of the more conceptual aspects of the process still are not well understood. As a result, the equating literature struggles with rather complex criteria of equating, lack of a test-theoretic foundation, confusing terminology, and ad hoc analyses. A return to Lord's…
Descriptors: Equated Scores, Statistical Analysis, Computation, Data Collection
van der Linden, Wim J. – Journal of Educational Measurement, 2013
This article is a response to the commentaries on the position paper on observed-score equating by van der Linden (this issue). The response focuses on the more general issues in these commentaries, such as the nature of the observed scores that are equated, the importance of test-theory assumptions in equating, the necessity to use multiple…
Descriptors: Equated Scores, Test Theory, Transformations (Mathematics)
Shahat, Mohamed A.; Ohle, Annika; Treagust, David F.; Fischer, Hans E. – International Journal of Science and Mathematics Education, 2013
Educators and policymakers envision the future of education in Egypt as enabling learners to acquire scientific inquiry and problem-solving skills. In this article, we describe the validation of a model for problem solving and the design of instruments for evaluating new teaching methods in Egyptian science classes. The instruments were based on…
Descriptors: Foreign Countries, Questionnaires, Problem Solving, Science Instruction
Rantanen, Pekka – Assessment & Evaluation in Higher Education, 2013
A multilevel analysis approach was used to analyse students' evaluation of teaching (SET). The low value of inter-rater reliability stresses that any solid conclusions on teaching cannot be made on the basis of single feedbacks. To assess a teacher's general teaching effectiveness, one needs to evaluate four randomly chosen course implementations.…
Descriptors: Test Reliability, Feedback (Response), Generalizability Theory, Student Evaluation of Teacher Performance

Peer reviewed
Direct link
