FREE ELECTRONIC LIBRARY - Dissertations, online materials

Pages:   || 2 |

«This chapter inaugurates a new series in the National Evaluation Systems, Inc., proceedings of its annual October- testfest. It seemed appropriate to ...»

-- [ Page 1 ] --

Ask Mister Assessment Person:

How Do You Estimate

the Reliability of

Teacher Licensure/

Certification Tests?

Ronald A. Berk

This chapter inaugurates a new series in the National

Evaluation Systems, Inc., proceedings of its annual October-

testfest. It seemed appropriate to begin the "Ask Mister

Assessment Person" series with the year 2000 book because

I could not think of any other gimmick to lure you into

reading my semitechnical thinking on the topic of reliability.

The Q & A format provides a direct, no-nonsense mechanism for addressing the most important, practical, probing, interesting, and fun issues on a totally boring, snor-o-matic subject. You are probably thinking, "Who is Ron to set himself up as an assessment authority as lofty as Mister Assessment Person?" Hmmm. Actually, this guy is not the authority you assume. He will simply give answers to the most frequently asked technical questions, and—consistent with his previous writings—if he does not know the "correct" answer, he will probably just make one up. By the way, why in the world am I writing about myself in third person?

Simply because I hope that you will find this format, and the information, useful and entertaining.

Ronald A. Berk is Professor and Assistant Dean for Teaching in the School of Nursing at The Johns Hopkins University in Baltimore, Maryland.

Berk Mister Assessment Person's Top 10 questions and answers on reliability, listed below, address changes in teacher licensure/certification test development practices, technical issues in reliability, and the Standards in Educational and Psychological Testing that span more than the last half of this century. Some of the information may be review, some of it may be new, but all of it is intended to give you a clue about the direction in which we are heading into the next millennium.

Q1: Why is the Kuder-Richardson Formula 20 estimate of reliability reported for most published tests?

A1: Great question! There are several reasons: (1) It has withstood the test of time—63 years; (2) it measures an important characteristic of the test scores—internal consistency; (3) it is practical to estimate based on only a single test administration; (4) it is relatively simple to compute and interpret; and (5) it is appropriate for the intended score uses of most published tests, which are norm referenced.

Q2: What does the K-R20 mean?

A2: I do not have a clue. Just kidding. It is an index of the internal consistency reliability of the scores derived from the a

–  –  –

The three components that directly affect the magnitude of the index are the number of items (k), item variances (pq), and the test variance (sx2).

Generally, long tests of 50–100 items with difficult items (.40 p.70) of moderate-to-high item discrimination (item to total score correlation r.30) yield high test variance with scores approximating a normal distribution. Tests with these characteristics measuring a single content area or construct usually possess very high internal consistency as indicated by K-R20s in the upper.80s to upper.90s. The coefficient estimates the extent to which the items consistently measure one construct (i.e., whether the test is unidimensional).

Q3: How is Cronbach's alpha coefficient different from the K-R20?

A3: Where are you getting these questions? Oops! Sorry, I am not supposed to answer a question with a question.

The easiest way to understand the difference between

alpha and the K-R20 is to inspect the formulas:

∑ si2 α = k – 1 ⎛1 – s 2 ⎞ k ⎝ x⎠ What is different? "The symbol α." Good, but that is just the Greek letter alpha. Look further to the right.

Now do you see it? "Yup. It is si2." You seem confident now. Is that your final answer? "Not really. Can I use one of my lifelines, Rege?" No. You don't have any lifelines. This is a book chapter, and my name is not Rege. "Okay, then that is my final answer."

The si2 is the only element that differs from the pq in the K-R20 formula. The former is simply a generic symbol for the variance of any item, regardless of numerical value; the latter is the variance for an item scored 1 or 0. Alpha can be computed from dichotomous items (1, 0) or polytomous items (5–0), such as constructed

–  –  –

These desirable statistical characteristics of normreferenced tests are neither desirable for nor characteristic of teacher licensure and certification tests. The distribution of performance is usually not normal

–  –  –

because the majority of candidates should be above the passing score. Item difficulties are usually.70 and higher, and item discrimination indices may be below.30. Test length is usually shorter due to a combination of multiple-choice and constructedresponse formats. This profile of item statistics and test characteristics tends to produce comparatively lower test score variance and, consequently, lower K-R20 coefficients for the multiple-choice items. It is not unusual for coefficients on these tests to peak in the mid.80s. Probably the only strategy one could use to increase the K-R20 would be to add 10, 20, or more moderately difficult items, which may not be consistent with either the test content specifications or administration time limits.

Conceptual perspective. Conceptually, the K-R20 estimates the internal consistency of individual test scores.

The individual score is the point of decision making in norm-referenced tests. In licensure/certification tests, the decision point is the passing score. The most appropriate reliability evidence is the consistency of pass-fail classification decisions across parallel test administrations.

Although in neither case is there a single, preferred approach to quantify reliability, nor is there any index that adequately covers all relevant sources of measurement error, the conceptual distinction between the types of reliability evidence for individual score decisions and pass-fail decisions is clearly addressed in the 1999 edition of the Standards for Educational and Psychological Testing (Joint Committee on Standards of American Educational Research Association [AERA], American Psychological Association [APA], and National Council on Measurement in Education [NCME], 1999):

–  –  –

Standard 2.1 For each total score, subscore, or combination of scores that is to be interpreted, estimates of relevant reliabilities and standard errors of measurement or test information functions should be reported.

Comment: For all scores to be interpreted, users should be supplied with reliability data in enough detail to judge whether scores are precise enough for the user's intended interpretations. (p. 31) Standard 2.14 Conditional standard errors of measurement should be reported at several score levels if constancy cannot be assumed. Where cutscores are specified for selection or classification, the standard errors of measurement should be reported in the vicinity of each cutscore.

Comment: Estimation of conditional standard errors is usually feasible even with the sample sizes that are typically used for reliability analyses. If it is assumed that the standard error is constant over a broad range of score levels, the rationale for this assumption should be presented. (p. 35) Standard 2.15 When a test or combination of measures is used to make categorical decisions, estimates should be provided of the percentage of examinees who would be classified in the same way on two applications of the procedure, using the same form or alternate forms of the instrument.

–  –  –

Comment: When a test or composite is used to make categorical decisions, such as pass-fail, the standard error measurement at or near the cutscore has important implications for the trustworthiness of these decisions. (p. 35) Standard 14.15 Estimates of the reliability of test-based credentialing decisions should be provided.

Comment: The standards for decision reliability described in chapter 2 [2.14 and 2.15 above] are applicable to tests used for licensure and certification. Other types of reliability estimates and associated standard errors of measurement may also be useful, but the reliability of the decision of whether or not to certify is of primary importance.

(p. 162) Standards 2.1 and 2.15 draw the strongest contrast between the types of decisions to be made. Although the K-R20 can provide useful information about the reliability of scores on a teacher licensure test, an index of the consistency of pass-fail decisions is more appropriate, meaningful, and essential evidence of reliability.

These standards indicate that different estimates of reliability and standard errors of measurement are required for these tests. However, the choice of estimation techniques and the minimum acceptable level for any index remains a matter of professional judgment.

–  –  –

ness of a decision consistency index for licensure/ certification tests. We hope that Standards 2.15 and

14.15 will have a greater impact on practice.

Q6: What determines the type(s) of reliability evidence for teacher licensure/certification tests?

A6: It is the sources of measurement error that can contaminate the scores and, ultimately, the dependability of the decisions based on those scores. There are at least three forms of evidence: (1) decision consistency based on the cutscore, (2) standard error of measurement of the scores near the cutscore, and (3) interjudge reliability of subjectively scored item formats.

Q7: How do you compute the consistency of pass-fail decisions for a teacher licensure/certification test?

–  –  –

which means count the number of pass-pass (npp) and fail-fail (nff) examinees and divide by the total N.

An index published one year later by Swaminathan, Hambleton, and Algina (1974) was the kappa coefficient or, as it was spoken in the Acropolis, κ. This index

–  –  –

or the ratio of the difference between the proportion of classification agreement observed and chance agreement (po – p) to the difference between the maximum proportion of classification agreement, which is 1, and chance agreement (1 – pc).

The κ statistic was actually created 14 years earlier by Cohen (1960) as a generalized proportion agreement index, frequently used to estimate interjudge agreement. The statistical properties and extension of κ to weighted κ have been described by Cohen (1968) and Fleiss, Cohen, and Everitt (1969). Swaminathan et al.

(1974) just applied κ to estimate the decision consistency of criterion-referenced tests using a pass-fail cutscore.

The debate over the advantages and disadvantages of using po, κ, or both remains unresolved. The technical issues have been discussed at length (and ad nauseum) by Mister Assessment Person elsewhere (see Berk, 1980, 1984). The bottom line: po is an unbiased estimate of decision consistency that is simple to compute, interpret, and explain; κ is a biased estimate with a long list of limitations and statistical conditions that complicate its interpretation. In other words, use po. It would be check marked as a Best Buy in Consumer Reports.

–  –  –

Q8: What trends in teacher licensure/certification testing over the past 30 years affect the way reliability is conceptualized and estimated?

A8: There have been at least four trends or characteristics related to (1) test structure and item format, (2) setting the passing score, (3) scoring the test, and (4) the feasibility of two test administrations to the same candidates.

Test structure. For nearly two millennia until the 1990s, the tests covered both pedagogy and subject area sections in multiple-choice format. In the 1990s, coterminous with the performance-assessment movement, teacher licensure tests shifted from 100% multiple-choice format toward a combination of multiple-choice and constructed-response (usually a writing sample) or performance-assessment formats.

This trend by both major test publishers—National Evaluation Systems, Inc., and the other one—had a directional impact on test structure: The number of multiple-choice items was reduced to permit time for the writing sample. This would affect the setting of cutscores, test scoring, and reliability, particularly in the next few paragraphs.

Passing score. How do you set a passing score with two different test sections, different item formats, and different numbers of points? I know I am not supposed to raise questions in this answer section, but I got carried away with my excitement over this topic; plus the question seemed more dramatic than simply answering the question. Seriously, there are two basic choices: (1) set a cutscore based on a total score or weighted total score of the combined sections (compensatory scoring) or (2) set a separate cutscore for each section and then set another standard for how many sections must be passed (conjunctive scoring).


The latter is often used when there are several performance-assessment exercises to be passed rather than only multiple-choice and writing sections. Both of these strategies as well as a combination of the two have been used with various licensure/certification tests. The compensatory model has been the most frequent choice for teacher licensure tests.

Test scoring. As if these scoring models and methods for setting cutscores are not complicated enough, another issue is the difference in score ranges for different item formats. Multiple-choice items are right or wrong, scored dichotomously as 1, 0; constructed-response and performance-assessment formats use holistic or analytic scoring rubrics or benchmarks, scored polytomously, such as 5–0. Remember how the K-R20 was restricted to dichotomous items and coefficient alpha eliminated that restriction? "No!" Well, maybe you should go back to questions 2 and 3 and review the answers before passing "Go" and collecting $200.

Anyway, an analogous problem exists in computing a decision-consistency estimate. Both po and κ assume dichotomously scored items. Then how do you estimate these indices for a constructed-response item?

Stay tuned to the answer to the next question.

Pages:   || 2 |

Similar works:

«MINUTES OF THE CITY OF LAS VEGAS REGULAR CITY COUNCIL MEETING HELD ON DECEMBER 17, 2014 AT 6: 00 P. M. IN THE CITY COUNCIL CHAMBERS MAYOR: Alfonso E. Ortiz, Jr. COUNCILORS: Tonita GuruleGiros David L. Romero Joey Herrera Vincent Howell ALSO PRESENT: Elmer MartinezActing City Manager Casandra FresquezCity Clerk Dave RomeroCity Attorney Eugene GarciaSergeant at Arms CALL TO ORDER Mayor Alfonso E. Ortiz, Jr. called the meeting to order at 6: 00 p. m. ROLL CALL PLEDGE OF ALLEGIANCE MOMENT OF...»

«On June 6, 2010 eight Sisters of St. Dominic of Blauvelt, New York celebrated their 80th, 75th 70th and 60th Jubilees. Liturgy was celebrated at the Motherhouse. Sr. Anne Cecile Merrill celebrated her 80th Jubilee; Sr. Jean Marie Rathgaber celebrated her 75th Jubilee; Sr. Jean Beagan celebrated her 70th Jubilee, and Sr. Vincent Cirelli, Sr. Ann Marie Colvin, Sr. Katherine Downing, Sr. Virginia Kissack, and Sr. Madeleine McGill celebrated their 60th Jubilees. Sr. Anne Cecile Merrill entered the...»

«David J. Cornell, PhD, CSCS, CES, EP-C, TSAC-F CURRICULUM VITAE CONTACT INFORMATION Student Physical Therapist Address: Physical Therapy & Athletic Training, Suite 350 Course Lecturer / Laboratory Instructor 3409 N. Downer Ave Human Performance & Sport Physiology Laboratory Milwaukee, WI 53211-2956 Integrative Health Care & Performance Unit Office: Pavilion, Room 375 Department of Kinesiology Email: dcornell@uwm.edu College of Health Sciences Web: http://uwm.edu/hpsp-lab/ University of...»

«‘Tall Stories sets the benchmark for children’s theatre’ Sunday Times ‘An inventive and charming adaptation’ The List ‘An exquisite piece of theatre with plenty of lyricism. It’s definitely a show that leaves you smiling’ The Stage Tall Stories Theatre Company Jacksons Lane 269a Archway Road London N6 5AA, UK +44 (0) 20 8348 0080 info@tallstories.org.uk www.tallstories.org.uk Notes for teachers Tall Stories Tall Stories theatre company is a not-for-profit charitable...»

«CLASSROOM BEHAVIOUR PROBLEMS – GIALLO & LITTLE 21 Australian Journal of Educational & Developmental Psychology. Vol 3, 2003, pp 21-34 Classroom Behaviour Problems: The Relationship between Preparedness, Classroom Experiences, and Self-efficacy in Graduate and Student Teachers Rebecca Giallo and Emma Little RMIT University Australia ABSTRACT Past research suggests that teachers who are the most effective classroom managers, are teachers who are the most confident in their abilities. Therefore,...»

«TROY JOLLIMORE Professor, Philosophy Department California State University, Chico, CA 95929-0730 tjollimore@csuchico.edu 530-898-5122 (office) 530-514-0817 (cell) EDUCATION Princeton University (Ph.D. in Philosophy, 1999) University of Kings College, Halifax, N.S. (B.A. Honors in Philosophy, 1993) ACADEMIC APPOINTMENTS Professor, Philosophy Department, California State University, Chico (2010-present) Associate Professor, Philosophy Department, California State University, Chico (2006-2010)...»

«The Thoughtful Classroom Program Becoming a Strategic Teacher: Better Instruction, Deeper Learning, Higher Achievement One of the greatest challenges facing today’s school leaders is the challenge of raising the expertise of their teachers. We hear more calls for teacher effectiveness than ever before, and we are seeing that teacher effectiveness is becoming a significant part of the national discussion on education. Everybody agrees that what teachers do in the classroom matters deeply. A...»

«Dynashwar K.Mhaske et al. / International Journal on Pharmaceutical and Biomedical Research (IJPBR) Vol. 2(4), 2011, 107-111 ANTIMICROBIAL ACTIVITY OF METHANOLIC EXTRACT FROM RHIZOME And ROOTS OF VALERIANA WALLICHII Dynashwar K.Mhaske 1. Dinanath D. Patil2, Gurumeet C. Wadhawa3 1 Principal R.B.N.B.College Shrirampur. 2 Head department of chemistry, R.B.N.B. College shrirampur, 3 Post-Graduate Department of Chemistry and Research Centre, R.B.N.B. College, Shrirampur413 709(MS) E-mail address:...»

«The Impact of School-Wide Professional Development on Teachers’ Practices: A Case Study of a Reading First School in Pennsylvania by Aimee Leigh Morewood B.A., Mercyhurst College, 1999 M.Ed., Edinboro University of Pennsylvania, 2000 M.Ed., Gannon University, 2002 Submitted to the Graduate Faculty of the School of Education in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Pittsburgh UNIVERSITY OF PITTSBURGH School of Education This dissertation...»

«Teachers College, Columbia University Working Papers in TESOL & Applied Linguistics, 2008, Vol. 8, No. 2 The Forum A Multi-Layered Framework of Framing Rebekah J. Johnson Teachers College, Columbia University Professor Leslie M. Beebe has always been an unfailing source of encouragement, a wealth of knowledge and insight, and a wonderful critic. She has taught me much about pragmatics and sociolinguistics, the foundations of my current interests in the social construction of meaning in...»

«STUDIE 511 Pedagogická orientace, 2013, roč. 23, č. 4, s. 511–534 Validita a reliabilita výskumných nástrojov: princípy a reálna prax Peter Gavora Univerzita Tomáše Bati ve Zlíně, Fakulta humanitních studií, Centrum výzkumu Redakci zasláno 1. 4. 2013 / upravená verze obdržena 4. 6. 2013 / k uveřejnění přijato 24. 6. 2013 Abstrakt: Článok odpovedá na otázku, ako sa uplatňujú v praxi princípy zisťovania validity a reliability, ktoré udávajú renomované učebnice...»

«Econ 3373 Syllabus Version 08/21/15 Fall 2015 Boston College, Economics Page 1 of 8 Economics 3373 Impact Evaluation in Developing Countries Instructor: Paul Cichello Email: paul.cichello@bc.edu Office: Maloney Hall, 342 Office Hours: M 10 AM-noon; Wed 10-11 AM or by appt. Class Meetings: MW 3:00-4:15 PM; O’Neill 253 Course Description This course will help you understand the rationale for many government programs and nongovernment organization (NGO) interventions and how to properly...»

<<  HOME   |    CONTACTS
2016 www.dissertation.xlibx.info - Dissertations, online materials

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.