«This chapter inaugurates a new series in the National Evaluation Systems, Inc., proceedings of its annual October- testfest. It seemed appropriate to ...»
Ask Mister Assessment Person:
How Do You Estimate
the Reliability of
Ronald A. Berk
This chapter inaugurates a new series in the National
Evaluation Systems, Inc., proceedings of its annual October-
testfest. It seemed appropriate to begin the "Ask Mister
Assessment Person" series with the year 2000 book because
I could not think of any other gimmick to lure you into
reading my semitechnical thinking on the topic of reliability.
The Q & A format provides a direct, no-nonsense mechanism for addressing the most important, practical, probing, interesting, and fun issues on a totally boring, snor-o-matic subject. You are probably thinking, "Who is Ron to set himself up as an assessment authority as lofty as Mister Assessment Person?" Hmmm. Actually, this guy is not the authority you assume. He will simply give answers to the most frequently asked technical questions, and—consistent with his previous writings—if he does not know the "correct" answer, he will probably just make one up. By the way, why in the world am I writing about myself in third person?
Simply because I hope that you will find this format, and the information, useful and entertaining.
Ronald A. Berk is Professor and Assistant Dean for Teaching in the School of Nursing at The Johns Hopkins University in Baltimore, Maryland.
Berk Mister Assessment Person's Top 10 questions and answers on reliability, listed below, address changes in teacher licensure/certification test development practices, technical issues in reliability, and the Standards in Educational and Psychological Testing that span more than the last half of this century. Some of the information may be review, some of it may be new, but all of it is intended to give you a clue about the direction in which we are heading into the next millennium.
Q1: Why is the Kuder-Richardson Formula 20 estimate of reliability reported for most published tests?
A1: Great question! There are several reasons: (1) It has withstood the test of time—63 years; (2) it measures an important characteristic of the test scores—internal consistency; (3) it is practical to estimate based on only a single test administration; (4) it is relatively simple to compute and interpret; and (5) it is appropriate for the intended score uses of most published tests, which are norm referenced.
Q2: What does the K-R20 mean?
A2: I do not have a clue. Just kidding. It is an index of the internal consistency reliability of the scores derived from the a
The three components that directly affect the magnitude of the index are the number of items (k), item variances (pq), and the test variance (sx2).
Generally, long tests of 50–100 items with difficult items (.40 p.70) of moderate-to-high item discrimination (item to total score correlation r.30) yield high test variance with scores approximating a normal distribution. Tests with these characteristics measuring a single content area or construct usually possess very high internal consistency as indicated by K-R20s in the upper.80s to upper.90s. The coefficient estimates the extent to which the items consistently measure one construct (i.e., whether the test is unidimensional).
Q3: How is Cronbach's alpha coefficient different from the K-R20?
A3: Where are you getting these questions? Oops! Sorry, I am not supposed to answer a question with a question.
The easiest way to understand the difference between
alpha and the K-R20 is to inspect the formulas:
∑ si2 α = k – 1 ⎛1 – s 2 ⎞ k ⎝ x⎠ What is different? "The symbol α." Good, but that is just the Greek letter alpha. Look further to the right.
Now do you see it? "Yup. It is si2." You seem confident now. Is that your final answer? "Not really. Can I use one of my lifelines, Rege?" No. You don't have any lifelines. This is a book chapter, and my name is not Rege. "Okay, then that is my final answer."
The si2 is the only element that differs from the pq in the K-R20 formula. The former is simply a generic symbol for the variance of any item, regardless of numerical value; the latter is the variance for an item scored 1 or 0. Alpha can be computed from dichotomous items (1, 0) or polytomous items (5–0), such as constructed
These desirable statistical characteristics of normreferenced tests are neither desirable for nor characteristic of teacher licensure and certification tests. The distribution of performance is usually not normal
because the majority of candidates should be above the passing score. Item difficulties are usually.70 and higher, and item discrimination indices may be below.30. Test length is usually shorter due to a combination of multiple-choice and constructedresponse formats. This profile of item statistics and test characteristics tends to produce comparatively lower test score variance and, consequently, lower K-R20 coefficients for the multiple-choice items. It is not unusual for coefficients on these tests to peak in the mid.80s. Probably the only strategy one could use to increase the K-R20 would be to add 10, 20, or more moderately difficult items, which may not be consistent with either the test content specifications or administration time limits.
Conceptual perspective. Conceptually, the K-R20 estimates the internal consistency of individual test scores.
The individual score is the point of decision making in norm-referenced tests. In licensure/certification tests, the decision point is the passing score. The most appropriate reliability evidence is the consistency of pass-fail classification decisions across parallel test administrations.
Although in neither case is there a single, preferred approach to quantify reliability, nor is there any index that adequately covers all relevant sources of measurement error, the conceptual distinction between the types of reliability evidence for individual score decisions and pass-fail decisions is clearly addressed in the 1999 edition of the Standards for Educational and Psychological Testing (Joint Committee on Standards of American Educational Research Association [AERA], American Psychological Association [APA], and National Council on Measurement in Education [NCME], 1999):
Standard 2.1 For each total score, subscore, or combination of scores that is to be interpreted, estimates of relevant reliabilities and standard errors of measurement or test information functions should be reported.
Comment: For all scores to be interpreted, users should be supplied with reliability data in enough detail to judge whether scores are precise enough for the user's intended interpretations. (p. 31) Standard 2.14 Conditional standard errors of measurement should be reported at several score levels if constancy cannot be assumed. Where cutscores are specified for selection or classification, the standard errors of measurement should be reported in the vicinity of each cutscore.
Comment: Estimation of conditional standard errors is usually feasible even with the sample sizes that are typically used for reliability analyses. If it is assumed that the standard error is constant over a broad range of score levels, the rationale for this assumption should be presented. (p. 35) Standard 2.15 When a test or combination of measures is used to make categorical decisions, estimates should be provided of the percentage of examinees who would be classified in the same way on two applications of the procedure, using the same form or alternate forms of the instrument.
Comment: When a test or composite is used to make categorical decisions, such as pass-fail, the standard error measurement at or near the cutscore has important implications for the trustworthiness of these decisions. (p. 35) Standard 14.15 Estimates of the reliability of test-based credentialing decisions should be provided.
Comment: The standards for decision reliability described in chapter 2 [2.14 and 2.15 above] are applicable to tests used for licensure and certification. Other types of reliability estimates and associated standard errors of measurement may also be useful, but the reliability of the decision of whether or not to certify is of primary importance.
(p. 162) Standards 2.1 and 2.15 draw the strongest contrast between the types of decisions to be made. Although the K-R20 can provide useful information about the reliability of scores on a teacher licensure test, an index of the consistency of pass-fail decisions is more appropriate, meaningful, and essential evidence of reliability.
These standards indicate that different estimates of reliability and standard errors of measurement are required for these tests. However, the choice of estimation techniques and the minimum acceptable level for any index remains a matter of professional judgment.
ness of a decision consistency index for licensure/ certification tests. We hope that Standards 2.15 and
14.15 will have a greater impact on practice.
Q6: What determines the type(s) of reliability evidence for teacher licensure/certification tests?
A6: It is the sources of measurement error that can contaminate the scores and, ultimately, the dependability of the decisions based on those scores. There are at least three forms of evidence: (1) decision consistency based on the cutscore, (2) standard error of measurement of the scores near the cutscore, and (3) interjudge reliability of subjectively scored item formats.
Q7: How do you compute the consistency of pass-fail decisions for a teacher licensure/certification test?
which means count the number of pass-pass (npp) and fail-fail (nff) examinees and divide by the total N.
An index published one year later by Swaminathan, Hambleton, and Algina (1974) was the kappa coefficient or, as it was spoken in the Acropolis, κ. This index
or the ratio of the difference between the proportion of classification agreement observed and chance agreement (po – p) to the difference between the maximum proportion of classification agreement, which is 1, and chance agreement (1 – pc).
The κ statistic was actually created 14 years earlier by Cohen (1960) as a generalized proportion agreement index, frequently used to estimate interjudge agreement. The statistical properties and extension of κ to weighted κ have been described by Cohen (1968) and Fleiss, Cohen, and Everitt (1969). Swaminathan et al.
(1974) just applied κ to estimate the decision consistency of criterion-referenced tests using a pass-fail cutscore.
The debate over the advantages and disadvantages of using po, κ, or both remains unresolved. The technical issues have been discussed at length (and ad nauseum) by Mister Assessment Person elsewhere (see Berk, 1980, 1984). The bottom line: po is an unbiased estimate of decision consistency that is simple to compute, interpret, and explain; κ is a biased estimate with a long list of limitations and statistical conditions that complicate its interpretation. In other words, use po. It would be check marked as a Best Buy in Consumer Reports.
Q8: What trends in teacher licensure/certification testing over the past 30 years affect the way reliability is conceptualized and estimated?
A8: There have been at least four trends or characteristics related to (1) test structure and item format, (2) setting the passing score, (3) scoring the test, and (4) the feasibility of two test administrations to the same candidates.
Test structure. For nearly two millennia until the 1990s, the tests covered both pedagogy and subject area sections in multiple-choice format. In the 1990s, coterminous with the performance-assessment movement, teacher licensure tests shifted from 100% multiple-choice format toward a combination of multiple-choice and constructed-response (usually a writing sample) or performance-assessment formats.
This trend by both major test publishers—National Evaluation Systems, Inc., and the other one—had a directional impact on test structure: The number of multiple-choice items was reduced to permit time for the writing sample. This would affect the setting of cutscores, test scoring, and reliability, particularly in the next few paragraphs.
Passing score. How do you set a passing score with two different test sections, different item formats, and different numbers of points? I know I am not supposed to raise questions in this answer section, but I got carried away with my excitement over this topic; plus the question seemed more dramatic than simply answering the question. Seriously, there are two basic choices: (1) set a cutscore based on a total score or weighted total score of the combined sections (compensatory scoring) or (2) set a separate cutscore for each section and then set another standard for how many sections must be passed (conjunctive scoring).
The latter is often used when there are several performance-assessment exercises to be passed rather than only multiple-choice and writing sections. Both of these strategies as well as a combination of the two have been used with various licensure/certification tests. The compensatory model has been the most frequent choice for teacher licensure tests.
Test scoring. As if these scoring models and methods for setting cutscores are not complicated enough, another issue is the difference in score ranges for different item formats. Multiple-choice items are right or wrong, scored dichotomously as 1, 0; constructed-response and performance-assessment formats use holistic or analytic scoring rubrics or benchmarks, scored polytomously, such as 5–0. Remember how the K-R20 was restricted to dichotomous items and coefficient alpha eliminated that restriction? "No!" Well, maybe you should go back to questions 2 and 3 and review the answers before passing "Go" and collecting $200.
Anyway, an analogous problem exists in computing a decision-consistency estimate. Both po and κ assume dichotomously scored items. Then how do you estimate these indices for a constructed-response item?
Stay tuned to the answer to the next question.