Wim J. van der Linden Ronald K. Hambleton Eds.

Handbook of MODERN ITEM RESPONSE ^ THEORY

•

vmmmm zMmmm

mSjmm Springer

•*sli

Item response theory has become an essential component in the toolkit ot every researcher in the behavioral sciences. It provides a powerful means to study individual responses to a variety of stimuli, and the methodology has been extended ajid developed to cover many different models of interaction. This volume presents a wide-ranging handbook of item response theory, and its applications to educational and psychological testing. It will serve as both an introduction to the subject and also as a comprehensive reference volume for practitioners and researchers. It is organized into six major sections: j

models for items with polytomous response formats i

models for response time or multiple attempts on items

i

models for multiple abilities or cognitive components

i

nonparametric models

i

models for nonmonotone items

i

models with special assumptions about the response process

Each chapter in the book has been written by an expert in that particular topic, and the chapters have been carefully edited to ensure that a uniform style of notation and presentation is used throughout. As a result, all researchers whose work uses item response theory will find this an indispensable companion to their work and it will be the subject's reference volume for many years to come.

ISBN

ISBN 0-387-94661-6 www.springer-ny.com

0-387-94661-6

Wf S iJfplPii 9

780387 946610 >

Wim J. van der Linden Ronald K. Hambleton Editors

Handbook of Modern Item Response Theory With 57 Illustrations

Springer New York Berlin Heidelberg Barcelona Hong Kong London Milan PaHs

Mnv

Singapore Tokyo

c

•

KfiN Springer .

l

J

Wim J. van der Linden Faculty of Educational Sciences and Technology. University of Twente 7500 AE Enschede Netherlands

Ronald K. Hambleton School of Education University of Massachusetts Amherst, MA 01003 USA

Library of Congress Cataloging-in-Publication Data Linden, Wim J. van der Handbook of modern item response theory / Wim J. van der Linden, Ronald K. Hambleton. p. cm. Includes bibliographical references and indexes. ISBN 0-387-94661-6 (hardcover : alk. paper) 1. Item response theory. 2. Psychometrics. 3. Psychology- Mathematical models. 4. Rasch, G. (George), 1901I. Hambleton, Ronald K. II. Title. BF176.V56 1996 150'.28'7-dc20 95-49217

Printed on acid-free paper.

© 1997 Springer-Verlag New York Inc. All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer-Verlag New York, Inc., 175 Fifth Avenue, New York, NY 10010, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use of general descriptive names, trade names, trademarks, etc., in this publication, even if the former are not especially identified, is not to be taken as a sign that such names, as understood by the Trade Marks and Merchandise Marks Act, may accordingly be used freely by anyone. Production managed by Terry Kornak; manufacturing supervised by Johanna Tschebull. Typeset using I£TgX and Springer-Verlag's svsing.sty macro. Printed and bound by Edwards Brothers Inc., Ann Arbor, MI. Printed in the United States of America. 9

8

7

6

5

4 3

2

ISBN 0-387-94661-6 Springer-Verlag New York Berlin Heidelberg SPIN 10756629

Preface The central feature of item response theory (IRT) is the specification of a mathematical function relating the probability of an examinee's response on a test item to an underlying ability. Until the 1980s, interest was mainly in response functions for dichotomously-scored items. Nearly all of the research up to that time was focused on the statistical aspects of estimating and testing response functions and making them work in applications to a variety of practical testing problems. Important application of the models were, for example, to the problems of test-score equating, detection of item bias, test assembly, and adaptive testing. In the 1940s and 1950s, the emphasis was on the normal-ogive response function, but for statistical and practical reasons this function was replaced by the logistic response function in the late 1950s. Today, IRT models using logistic response functions dominate the measurement field. In practice, however, test items often have nominal, partial-credit, graded, or other polytomous response formats. Also, they may have to deal with multidimensional abilities, measure attitudes rather than abilities, or be designed to diagnose underlying cognitive processes. Due to computerization of tests, recording of response latencies has become within reach too, and models to deal with such data are needed. In the 1980s, psychological and educational researchers, therefore, shifted their attention to the development of models dealing with such response formats and types of data. The goal of this Handbook is to bring these new developments together in a single volume. Each chapter in this book contains an up-to-date description of a single IRT model (or family of models) written by the author(s) who originally proposed the model or contributed substantially to its development. Each chapter also provides complete references to the basic literature on the model. Another feature of the book is that all chapters follow the same fixed format (with only a few minor exceptions). They begin with an Introduction in which the necessity of the model is historically and practically motivated. In the next section, Presentation of the Model, the model is introduced, its mathematical form is described, its parameters are interpreted, and possible generalizations and extensions of it are discussed. The statistical aspects of the model are addressed in the following two sections, Parameter Estimation and Goodness of Fit. In addition to an explanation of the statistical methods available to estimate the parameters and determine

VI

Preface

Preface

the fit of the model to response data, these sections also contain references to existing software developed for the model. The next section, Empirical Example, presents one or more examples based on empirical data. These examples are intended to show how to use the model in a practical application. The final section, Discussion, covers possible remaining aspects of the use of the model. All chapters use a common notation for basic quantities such as item responses, ability and attitude variables, and item parameters. An introductory chapter to the book summarizes the theory available for the original unidimensional normal-ogive and logistic IRT models for items with dichotomously-scored responses. The remaining chapters in the book are presented in six sections: 1. Models for items with polytomous response formats. This section contains models for unordered and ordered discrete polytomous response formats. 2. Models for response time or multiple attempts on items. Models for tests with a time limit in which response time is recorded and models for tests recording the numbers of successes on replicated trials are presented in this section. 3. Models for multiple abilities or cognitive components. Sometimes item responses are produced by more than one ability or by a process with more than one cognitive component. This section of the book presents the models to be used in such cases. 4. Nonparametric models. Though models with a parametric form for response functions have dominated the literature, relaxation of these parametric assumptions can often be helpful and lead to some important results. Nonparametric approaches to response functions are brought together in this section. 5. Models for nonmonotone items. Models for the analysis of responses to statements in attitude instruments with response functions not monotonically increasing in the underlying variable are given in this section. 6. Models with special assumptions about the response process. The final section contains models for mixtures of response processes or ability distributions, conditional dependence between responses, and responses formats designed to allow for partial knowledge of test items. The book has not only been designed as a comprehensive introduction to modern test theory for practitioners as well as researchers but also as a daily reference by these two groups. In addition, we hope the book may serve as a valuable resource in graduate courses on modern developments in test theory. As its chapters can be read independently, teachers are able

vii

to design their own routes through the book to accommodate their course objectives. The editors appreciate the opportunity to acknowledge the support of the following institutions and persons during the production of this book. The American College Testing, Iowa City, Iowa, provided the first editor with generous hospitality during the first stage of the project. The Department of Educational Measurement and Data Analysis at the University of Twente and the Laboratory of Psychometric and Evaluative Research at the University of Massachusetts created the climate which fostered the work on this Handbook. Untiring secretarial support was given by Anita Burchartz-Huls, Colene Feldman, and Peg Louraine. Martin Gilchrist was a helpful source of information at Springer-Verlag throughout the whole project. But most of all, the editors are grateful to the authors of the chapters in this book who believed in the project and were willing to contribute their valuable time. Wim J. van der Linden Ronald K. Hambleton

University of Twente University of Massachusetts at Amherst

Contents Preface 1.

I.

II.

v Item Response Theory: Brief History, Common Models, and Extensions Wim J. van der Linden and Ronald K. Hambleton

Models for Items with Polytomous Response Formats Introduction 2. The Nominal Categories Model R. Darrell Bock 3. A Response Model for Multiple-Choice Items David Thissen and Lynne Steinberg 4. The Rating Scale Model Erling B. Andersen 5. Graded Response Model Fumiko Samejima 6. The Partial Credit Model Geofferey N. Masters and Benjamin D. Wright 7. A Steps Model to Analyze Partial Credit N.D. Verhelst, C.A.W. Glas, and H.H. de Vries 8. Sequential Models for Ordered Responses Gerhard Tutz 9. A Generalized Partial Credit Model Eiji Muraki Models for Response Time or Multiple Attempts on Items Introduction 10. A Logistic Model for Time-Limit Tests N.D. Verhelst, H.H.F.M. Verstralen, and M.G.H. Jansen 11. Models for Speed and Time-Limit Tests Edward E. Roskam 12. Multiple-Attempt, Single-Item Response Models Judith A. Spray

1

29 33 51 67 85 101 123 139 153

165 169 187 209

x

Contents

III. Models for Multiple Abilities or Cognitive Components Introduction 13. Unidimensional Linear Logistic Rasch Models Gerhard H. Fischer 14. Response Models with Manifest Predictors Aeilko A. Zwinderman 15. Normal-Ogive Multidimensional Model Roderick P. McDonald 16. A Linear Logistic Multidimensional Model for Dichotomous Item Response Data Mark D. Reckase 17. Loglinear Multidimensional Item Response Model for Polytomously Scored Items Henk Kelderman 18. Multicomponent Response Models Susan E. Embretson 19. Multidimensional Linear Logistic Models for Change Gerhard H. Fischer and Elisabeth Seliger IV. Nonparametric Models Introduction 20. Nonparametric Models for Dichotomous Responses Robert J. Mokken 21. Nonparametric Models for Polytomous Responses Ivo W. Molenaar 22. A Functional Approach to Modeling Test Data J.O. Ramsay V.

Models for Nonomonotone Items Introduction 23. An Hyperbolic Cosine IRT Model for Unfolding Direct Responses of Persons to Items David Andrich 24. PARELLA: An IRT Model for Parallelogram Analysis Herbert Hoijtink

VI. Models with Special Assumptions About the Response Process Introduction 25. Multiple Group IRT R. Darrell Bock and Michele F. Zimowski 26. Logistic Mixture Models Jiirgen Rost

Contents

221 225 245 258

271

287 305 323

347 351 369 381

395 399 415

431 433 449

27. Models for Locally Dependent Responses: Conjunctive Item Response Theory Robert J. Jannarone 28. Mismatch Models for Test Formats that Permit Partial Information To Be Shown T.P. Hutchinson

xi

465

481

Subject Index

495

Author Index

503

Contributors Erling B. Andersen Institute of Statistics, University of Copenhagen, DK-1455 Copenhagen, Denmark David Andrich Murdoch University, School of Education, Murdoch Western Australia, 6150, Australia R. Darnell Bock University of Chicago, Department of Behavioral Sciences, Chicago, IL 60637, USA Susan E. Embretson University of Kansas, Department of Psychology, Lawrence, KS 66045, USA Gerhard H. Fischer Institut fur Psychologie, Universitat Wien, 1010 Wien, Austria H.H. de Vries SWOKA, 2515JL, The Hague, The Netherlands C.A.W. Glas Department of Educational Measurement and Data Analysis, University of Twente, 7500 AE Enschede, Netherlands Herbert Hoijtink University of Gronigen, Department of Statistics and Measurement Theory, 9712 TS Groningen, Netherlands T.P. Hutchinson Department of Mathematical Statistics, University of Sydney, Sydney NSW 2006, Australia Robert J. Jannarone Electrical & Computer Engineering Department, University of South Carolina, Columbia, SC 29205, USA Margo G.H. Jansen Department of Education, University of Groningen, Grote Rozenstraat 38, 9712 TJ Groningen, Netherlands

xiv

Contributors

Contributors

xv

Henk Kelderman Department of Industrial and Organizational Psychology, Free University, 1081 BT Amsterdam, Netherlands

Gerhard Tutz Technische Universitat Berlin, Institut fur Quantitative Methoden, 10587 Berlin, Germany

Geofferey N. Masters The Australian Council for Educational Research, Hawthorne, Victoria 3122, Australia

N.D. Verhebt Cito, 6801 MG Arnhem, Netherlands

Roderick P. McDonald Department of Psychology, University of Illinois, Champaign, IL 61820-6990, USA Robert J. Mokken Franklinstraat 16-18, 1171 BL Badhoevedorp, Netherlands Ivo W. Molenaar University of Gronigen, Department of Statistics and Measurement Theory, 9712 TS Groningen, Netherlands Eiji Muraki Educational Testing Service, Princeton, NJ 08541, USA J.O. Ramsay Department of Psychology, 1205 Dr. Penfield Avenue, Montreal H3A 1B1 PQ, Canada Mark D. Reckase American College Testing, Iowa City, IA 52243, USA Edward E. Roskam NICI, Psychologisch Laboratorium, 6500 HE Nijmegen, Netherlands Jiirgen Rost IPN, D-2300 Kiel 7, Germany Fumiko Samejima Psychology Department, University of Tennessee, Knoxville, TN 27996, USA Elisabeth Seliger Department of Psychology, University of Vienna, A-1010, Vienna, Austria Judith A. Spray American College Testing, Iowa City, IA 52243, USA Lynne Steinberg Department of Psychology, Syracuse University, Syracuse, NY 13244-2340, USA Dawd Thissen Department of Psychology, University of North Carolina, Chapel Hill, NC 27599-3270, USA

Huub Verstralen Cito, 6801 MG Arnhem, Netherlands Benjamin D. Wright Department of Education, University of Chicago, Chicago, IL 60637, USA Michele F. Zimowski Department of Behavioral Sciences, University of Chicago, Chicago, IL 60637, USA Aeilko H. Zwinderman Department of Medical Statistics, University of Leiden, 2300 RA Leiden, Netherlands

1 Item Response Theory: Brief History, Common Models, and Extensions Wim J. van der Linden and Ronald K. Hambleton Long experience with measurement instruments such as thermometers, yardsticks, and speedometers may have left the impression that measurement instruments are physical devices providing measurements that can be read directly off a numerical scale. This impression is certainly not valid for educational and psychological tests. A useful way to view a test is as a series of small experiments in which the tester records a vector of responses by the testee. These responses are not direct measurements, but provide the data from which measurements can be inferred. Just as in any experiment, the methodological problem of controlling responses for experimental error holds for tests. Experimental error with tests arises from the fact that not only the independent variables but also other variables operational in the experiment, generally known as background, nuisance, or error variables, may have an impact on the responses concerned. Unless adequate control is provided for such variables, valid inferences cannot be made from the experiment. Literature on experimental design has shown three basic ways of coping with experimental error (Cox, 1958): (1) matching or standardization; (2) randomization; and (3) statistical adjustment. If conditions in an experiment are matched, subjects operate under the same levels of error or nuisance variables, and effects of these variables cannot explain any differences in experimental outcomes. Although matching is a powerful technique, it does have the disadvantage of restricted generalizability. Experimental results hold only conditionally on the levels of the matching variables that were in force. Randomization is based on the principle that if error variables cannot be manipulated to create the same effects, random selection of conditions guarantees that these effects can be expected to be the same on average.

2

1. Item Response Theory

Wim J. van der Linden and Ronald K. Hambleton

Thus, these effects can not have a systematic influence on the responses of the subjects. Statistical adjustment, on the other hand, is a technique for post hoc control, that is, it does not assume any intervention from the experimenter. The technique can be used when matching or randomization is impossible, but the technique does require that the levels of the relevant error variables have been measured during the experiment. The measurements are used to adjust the observed values of the dependent variables for the effects of the error variables. Adjustment procedures are model-based in the sense that quantitative theory is needed to provide the mathematical equations relating error variables to the dependent variables. In a mature science such as physics, substantive theory may be available to provide these equations. For example, if temperature or air pressure are known to be disturbing variables that vary across experimental conditions and well-confirmed theory is available to predict how these variables affect the dependent variable, then experimental findings can be adjusted to common levels of temperature or pressure afterwards. If no substantive theory is available, however, it may be possible to postulate plausible equations and to estimate their parameters from data on the error and dependent variables. If the estimated equations fit the data, they can be used to provide the adjustment. This direction for analysis is popular in the behavioral sciences, in particular in the form of analysis of covariance (ANCOVA), which postulates linear (regression) equations between error and dependent variables.

Classical Test Theory Classical test theory fits in with the tradition of experimental control through matching and randomization (van der Linden, 1986). The theory starts from the assumption that systematic effects between responses of examinees are due only to variation in the ability (i.e., true score) of interest. All other potential sources of variation existing in the testing materials, external conditions, or internal to the examinees are assumed either to be constant through rigorous standardization or to have an effect that is nonsystematic or "random by nature." The classical test theory model, which was put forward by Spearman (1904) but received its final axiomatic form in Novick (1966), decomposes the observed test score into a true score and an error score. Let Xjg be the observed-score variable for examinee j on test g, which is assumed to be random across replications of the experiment. The classical model for a fixed examinee j postulates that Xjg = rjg+Ejg, (1) where rjg is the true score defined as the expected observed score, SjgXjg, and Ejg = Xjg - Tjg is the error score. Note that, owing to the definition

3

of the true and error scores, the model does not impose any constraints on Xjg and is thus always "true." Also, observe that Tjg is defined to cover all systematic information in the experiment, and that it is only a parameter of interest if experimental control of all error variables has been successful. Finally, it is worthwhile to realize that, as in any application of the matching principle, Tjg has only meaning conditionally on the chosen levels of the standardized error variables. The true score is fully determined by the test as designed—not by some Platonic state inside the examinee that exists independent of the test (Lord and Novick, 1968, Sec. 2.8). If random selection of examinees from a population is a valid assumption, the true score parameter, Tjg, has to be replaced by a random true score variable, Tj*. The model in Eq. (1) becomes: Xit = Tj. + Ej.

(2)

(Lord and Novick, 1968, Sec. 2.6). It is useful to note that Eq. (2) is also the linear equation of the one-way analysis of variance (ANOVA) model with a random factor. This equivalence points again to the fact that educational and psychological tests can be viewed as regular standardized experiments.

Test-Dependent True Scores As mentioned earlier, a serious disadvantage of experiments in which matching is used for experimental control is reduction of external validity. Statistical inferences from the data produced by the experiment cannot be generalized beyond the standardized levels of its error or nuisance variables. The same principle holds for standardized tests. True scores on two different tests designed to measure the same ability variable, even if they involve the same standardization of external and internal conditions, are generally unequal. The reason is that each test entails its own set of items and that each item has different properties. From a measurement point of view, such properties of items are nuisance or error variables that escape standardization.

The practical consequences of this problem, which has been documented numerous times in the test-theoretic literature [for an early reference, see Loevinger (1947)], are formidable. For example, it is impossible to use different versions of a test in a longitudinal study or on separate administrations without confounding differences between test scores by differences between the properties of the tests.

Statistical Adjustment for Differences Between Test Items Perhaps the best way to solve the problem of test-dependent scores is through the third method of experimental control listed previously—statistical adjustment. The method of statistical adjustment requires explicit parametrization of the ability of interest as well as the properties of the

4

1. Item Response Theory

Wim J. van der Linden and Ronald K. Hambleton

items, via a model that relates their values to response data collected through the test. If the model holds and the item parameters are known, the model adjusts the data for the properties of items in the test, and, therefore, can be used to produce ability measures that are free of the properties of the items in the test. This idea of statistical adjustment of ability measures for nuisance properties of the items is exactly analogous to the way analysis of covariance (ANCOVA) was introduced to parametrize and subsequently "remove" the effects of covariates in an ANOVA study.

Item Response Theory Mathematical models to make statistical adjustments in test scores have been developed in item response theory (IRT). The well-known IRT models for dichotomous responses, for instance, adjust response data for such properties of test items as their difficulty, discriminating power, or liability to guessing. These models will be reviewed in this chapter. The emphasis in this volume, however, it not on IRT models for handling dichotomously scored data, but is on the various extensions and refinements of these original models that have emerged since the publication of Lord and Novick's Statistical Theories of Mental Test Scores (1968). The new models have been designed for response data obtained under less rigorous standardization of conditions or admitting open-ended response formats, implying that test scores have to be adjusted for a larger array of nuisance variables. For example, if, in addition to the ability of interest, a nuisance ability has an effect on responses to test items, a second ability parameter may be added to the model to account for its effects. Several examples of such multidimensional IRT models are presented in this volume. Likewise, if responses are made to the individual answer choices of an item (e.g., multiple-choice item), they clearly depend on the properties of the answer choices, which have to be parametrized in addition to other properties of the item. A variety of models for this case, generally known as polytomous IRT models, is also presented in this volume. The methodological principle underlying all of the models included in this volume is the simple principle of a separate parameter for each factor with a separate effect on the item responses. However, it belongs to the "art of modeling" to design a mathematical structure that reflects the interaction between these factors and at the same time makes the model statistically tractable. Before introducing the models in a formal way, a brief review of common dichotomous IRT models is given in this chapter. These were the first IRT models to be developed, and they have been widely researched and extensively used to solve many practical measurement problems (Hambleton and Swaminathan, 1985; Lord, 1980).

5

Review of Dichotomous IRT Models As its name suggests, IRT models the test behavior not at the level of (arbitrarily defined) test scores, but at the item level. The first item format addressed in the history of IRT was the dichotomous format in which responses are scored either as correct or incorrect. If the response by examinee j to item i is denoted by a random variable C/y, it is convenient to code the two possible scores as Uij = 1 (correct) and U^ = 0 (incorrect). To model the distribution of this variable, or, equivalently, the probability of a correct response, the ability of the examinee is presented by a parameter 0 G (—oo, +00), and it is assumed in a two-parameter model that the properties of item i that have an effect on the probability of a success are its "difficulty" and "discriminating power," which are represented by parameters bi e (-00, +00) and a,i £ (0, +00), respectively. These parameters are simply symbols at this point; their meaning can only be established if the model is further specified. Since the ability parameter is the structural parameter that is of interest and the item parameters are considered nuisance parameters, the probability of success on item i is usually presented as Pi{6), that is, as a function of 8 specific to item i. This function is known as the item response function (IRF). Previous names for the same function include item characteristic curve (ICC), introduced by Tucker (1946), and trace line, introduced by Lazarsfeld (1950). Owing to restrictions in the range of possible probability values, item response functions cannot be linear in 6. Obviously, the function has to be monotonically increasing in 9. The need for a monotonically increasing function with a lower and upper asymptote at 0 and 1, respectively, suggests a choice from the class of cumulative distributions functions (cdf's). This choice was the one immediately made when IRT originated.

Normal-Ogive Model The first IRT model was the normal-ogive model, which postulated a normal cdf as a response function for the item: ,a,(

For a few possible values of the parameters bi and at, Figure 1 displays the shape of a set of normal-ogive response functions. From straightforward analysis of Eq. (3) it is clear that, as demonstrated in Figure 1, the difficulty parameter bi is the point on the ability scale where an examinee has a probability of success on the item of 0.50, whereas the value of a* is proportional to the slope of the tangent to the response function at this point. Both formal properties of the model justify the interpretation suggested by the names of the parameters. If bt increases in value, the response

1. Item Response Theory

J. van der Linden and Ronald K. Hambleton 1 0.9 0.8

°

0.6

B I L I T Y

0.5 0.4 0.3 0.2

0-1 0 = -4

-3

-2

-1

1 ABILITY

FIGURE 1. Two-parameter normal-ogive response functions. function moves to the right and a higher ability is needed to produce the same probability of success on the item. Also, the larger the value of a,, the better the item discriminates between the probabilities of success of examinees with abilities below and above 6 = bi. Although credit for the normal-ogive model is sometimes given to Lawley (1943), the same basic model had been studied earlier by Ferguson (1942), Mosier (1940, 1941), and Richardson (1936). In fact, the original idea for the model can be traced to Thurstone's use of the normal model in his discriminal dispersions theory of stimulus perception (Thurstone, 1927b). This impact of the psychophysical scaling literature on the early writings of these authors is obvious from their terminology. Researchers in psychophysics study the relation between the physical properties of stimuli and their perception by human subjects. Its main method is to present a stimulus of varying strength, for example, a tone of varying loudness, where the subject's task is to report whether or not he/she was able to detect the stimulus. Since the probability of detection is clearly an increasing function of the physical strength of the stimulus, the normal cdf with the parametrization in Eq. (3) was used as response function. However, in this model, 9 stands for the known strength of a stimulus and the interest was only in the parameters a* and bi, the latter being known as the limen of the stimulus—a name also used for the difficulty of the item in the early days of IR-T. Mosier (1940, 1941) was quite explicit in his description of the parallels between psychophysics and psychometric modeling. In his 1940 publication he provides a table with the different names used in the two disciplines, which ii' fact describe common quantities. Early descriptions of the normal-ogive model as well as of the procedures

7

for its statistical analysis were not always clear. For example, ability 9 was not always considered as a latent variable with values to be estimated from the data but taken to be the observed test score. Equation (3) was used as a model for the (nonlinear) regression of item on test scores. Richardson, on the other hand, used the normal-ogive model to scale a dichotomous external success criterion on the observed-score scale of several test forms differing in difficulty. In the same spirit that originated with Binet and Simon (1905), and was reinforced by the widespread use of Thurstone's method of absolute scaling (Thurstone, 1925, 1927a), the ability variable was always assumed to be normally distributed in the population of examinees. The belief in normality was so strong that even if an observed ability distribution did not follow a normal distribution, it was normalized to find the "true" scale on which the model in Eq. (3) was assumed to hold. Also, in accordance with prevailing practice, the ability scale was divided into a set of—typically seven—intervals defined by equal standarddeviation units. The midpoints of the intervals were the discrete ability scores actually used to fit a normal-ogive model. The first coherent treatment of the normal-ogive model that did not suffer from the above idiosyncrasies was given in Lord (1952). His treatment also included psychometric theory for the bivariate distribution of item and ability scores, the limiting frequency distributions of observed scores on large tests, and the bivariate distribution of observed scores on two tests measuring the same ability variable. All these distributions were derived with the help of the normal-ogive model. Parameter Estimation. In the 1940s and 1950s, computers were not available and parameter estimation was a laborious job. The main estimation method was borrowed from psychophysics and known as the constant process. The method consisted of fitting a weighted regression line through the data points and the empirical probits. [The latter were defined as the inverse transformation of Eq. (3) for empirical proportions of successes on the item.] The weights used in the regression analysis were known as the Miiller-Urban weights. Lawley (1943) derived maximum-likelihood (ML) estimators for the item parameters in the normal-ogive model and showed that these were identical to the constant-process estimators upon substitution of empirical probits in the Miiller-Urban weights. Lord and Novick (1968, Sees. 16.8-16.10) presented the following set of equations, derived under the assumption of a normal ability distribution, that relate the parameters bi and a, to the classical item-7r value and biserial item-test correlation, p^: * = - 7 ^ .

(4) (5)

8

1. Item Response Theory

Wim J. van der Linden and Ronald K. Hambleton

where 7; is defined by 7l

= $

(TTi)

(6)

and $(•) is the standard normal distribution function. Although the equations were based on the assumption of the normal-ogive model, plug-in estimates based on these equations have long served as heuristic estimates or as starting values for the maximum likelihood (ML) estimators of the parameters in the logistic model (Urry, 1974). Goodness of Fit. Formal goodness-of-fit tests were never developed for the normal-ogive model. In the first analyses published, the basic method of checking the model was graphical inspection of plots of predicted and empirical item-test regression functions [see, for example, Ferguson (1942) and Richardson (1936)]. Lord (1952) extended this method to plots of predicted and empirical test-score distributions and used a chi-square test to study the differences between expected and actual item performance at various intervals along the ability continuum.

Rasch or One-Parameter Logistic Model Rasch began his work in educational and psychological measurement in the late 1940s. In the 1950s he developed two Poisson models for reading tests and a model for intelligence and achievement tests. The latter was called "a structural model for items in a test" by him, but is now generally known as the Rasch model. Formally, the model is a special case of the Birnbaum model to be discussed below. However, because it has unique properties among all known IRT models, it deserves a separate introduction. A full account of the three models is given in Rasch (1960). Rasch's main motivation for his models was his desire to eliminate references to populations of examinees in analyses of tests (Rasch, 1960, Preface; Chap. 1). Test analysis would only be worthwhile if it were individualcentered, with separate parameters for the items and the examinees. To make his point, Rasch often referred to the work of Skinner, who was also known for his dislike of the use of population-based statistics and always experimented with individual cases. Rasch's point of view marked the transition from population-based classical test theory, with its emphasis on standardization and randomization, to IRT with its probabilistic modeling of the interaction between an individual item and an individual examinee. As will be shown, the existence of sufficient statistics for the item parameters in the Rasch model can be used statistically to adjust ability estimates for the presence of nuisance properties of the items in a special way. Poisson Model. Only the model for misreadings will be discussed here. The model is based on the assumption of a Poisson distribution for the number of reading errors in a text. This assumption is justified if the reading

9

process is stationary and does not change due to, for example, the examinee becoming tired or fluctuations in the difficulties of the words in the text. If the text consists of T words and X is the random variable denoting the number of words misread, then the Poisson distribution assumes that (7)

where parameter A is the expected number of misreadings. Equivalently, £ = X/T is the probability of misreading a single word sampled from the text. The basic approach followed by Rasch was to further model the basic parameter £ as a function of parameters describing the ability of the examinee and the difficulty of the text. If 0j is taken to represent the reading ability of examinee j and 6t represents the difficulty of text t, then £it can be expected to be a parameter decreasing in 6j and increasing in 6t. The simple model proposed by Rasch was 6

- ^

(8)

If the model in Eqs. (7) and (8) is applied to a series of examinees j = 1,..., TV reading a series of texts t = 1,..., T, it holds that the sum of reading errors in text t across examinees, Xt., is a sufficient statistic for -

(14)

The motivation for this substitution was the result of Haley (1952) that for a logistic cdf with scale factor 1.7, 1,(1.7a;), and a normal cdf, N(x): \N(x) - L(1.7x)\ < 0.01

for x e (-00,00)

[for a slight improvement on this result, see Molenaar (1974)]. At the same time, the logistic function is much more tractable than the normal-ogive function, while the parameters bi and o, retain their graphical interpretation shown in Figure 1. Birnbaum also proposed a third parameter for inclusion in the model to account for the nonzero performance of low-ability examinees on multiplechoice items. This nonzero performance is due to the probability of guessing correct answers to multiple-choice items. The model then takes the form Piifi) =

Cl)—

1

(15)

1 + exp{-ai(0 - bi) Equation (15) follows immediately from the assumption that the examinee either knows the correct response with a probability described by Eq. (14) or guesses with a probability of success equal to the value of Cj. From Figure 3, it is clear that the parameter c* is the height of the lower asymptote of the response function. In spite of the fact that Eq. (15) no longer defines a logistic function, the model is known as the three-parameter logistic (or 3PL) model, while the model in Eq. (14) is called the two-parameter logistic (or 2-PL) model. The c-parameter is sometimes referred to as the "guessing parameter," since its function is to account for the test item performance of low-ability examinees. One of Birnbaum's other contributions to psychometric theory was his introduction of Fisher's measure to describe the information structure of a

14

1. Item Response Theory

Wim J. van der Linden and Ronald K. Hambleton 1

15

Maximizing the logarithm of the likelihood function results in the following set of estimation equations:

0.8

=0,

0.6

(18)

=0, 0 4

Z = 1 , ... ,71.

0 2

-3

"2

—1

FIGURE 3. Three-parameter logistic response function. test. For a dichotomous IRT model, Fisher's measure for the information on the unknown ability 6 in a test of n items can be written as

(16)

=£ where It(0) is the information on 9 in the response on item i and P[{9) = •§§Pi(O). One of his suggestions for the use of these information measures was to influence test construction by first setting up a target for the information function of the test and then assembling the test to meet this target using the additivity in Eq. (16). Birnbaum also derived weights for optimally scoring a test in a two-point classification problem and compared the efficiency of several other scoring formulas for the ability parameter in the model. His strongest statistical contribution no doubt was his application of the theory of maximum likelihood to the estimation of both the ability and the item parameters in the logistic models, which will now be addressed. Parameters Estimation. Continuing the same notation as before, the likelihood function for the case of N examinees and n items associated with the 2-PL model can be written as (17) ll-t

The technique suggested by Birnbaum was jointly to solve the equations for the values of the unknown parameters in an iterative scheme, which starts with initial values for the ability parameters, solves the equations for the item parameters, fixes the item parameters, and solves the equations for improved estimates of the values of the ability parameters, etc. The solutions obtained by this procedure are known as the joint maximum likelihood (JML) estimates, although the name alternating maximum likelihood (AML) would seem to offer a better description of the procedure. Problems of convergence and a somewhat unknown statistical status of the JML estimators have led to the popularity of MML estimation for the 2PL and 3-PL models. The method was introduced by Bock and Lieberman (1970) for the normal-ogive model but immediately adopted as a routine for the logistic model. For a density function f(6) describing the ability distribution, the marginal probability of obtaining a response vector u = (ui,..., un) on the items in the test is equal to (19)

P(u\a,b) = /

The marginal likelihood function associated with the full data matrix u = {v.ij) is given by the multinomial kernel 2"

L{a,b;u) = fj

0, or

reparametrize the model by setting, say, ix

A

=

ZA e

and

1

Then,

35

1

PA

1

= -j= /

/-oo

e~t2/2dt = $(nA - JJLB);

(5)

(1)

and the model for choice becomes a binary logistic model with logit z = ZA ~ ZB- Similarly, 1 PB =

(2)

l + ez 1 + eThe advantage of this parametrization is that the logit is unbounded and suitable for linear modeling. The choice model is easily extended to selection among m alternatives by writing Pk =

+ 7T2 + ' • ' +

or 2-jh

e2"

(3)

In Eq. (3), the indeterminacy of scale becomes one of location, and it may be resolved either by setting YJk zk = ° or estimating m - 1 linearly independent contrasts between the m logits. In this model, the "logit" is the vector z' = [zx,z2,...,zm} (4) in (m— l)-dimensional space. In conditions where the respondents' choices may be considered independent trials with constant probabilities, the scalar logit in Eq. (1) may be called a binomial logit, and the vector logit in Eq. (4), called a multinomial logit. The binomial logit was introduced into statistics by Fisher and Yates (1953), and its values may be read from their table of the transformation of r to z, with r = 2P - 1. The multinomial logit was introduced by Bock (1966, 1970, 1975) and independently by Mantel (1966).

Relationship to Thurstone's Choice Models The logistic choice model approximates closely Thurstone's discriminalprocess model based on the normal distribution. According to the latter, a person's perception of the difference between alternatives A and B gives rise to a random "discriminal process" y =

+£,'

similarly, = $(MB ~ HA) = 1 - PA(6) This is the basis for Thurstone's (1927) model for comparative judgment. Its connection with the Bradley-Terry-Luce model lies in the close approximation of these probabilities by the logistic probabilities in Eqs. (1) and (2), with error nowhere greater than 0.01, when z = 1.7(HA — MB)The normal discriminal process model may be extended to the choice of one alternative out of m on the assumption that a person chooses the largest of m multivariate normal processes (Bock, 1956; Bock and Jones, 1968). Suppose the alternatives give rise to an m-vector process with normally distributed means /ih, standard deviations ah, and correlations phe- Then the differences between the process for alternative k and the remaining processes are (m — l)-variate normal with means jikh = Hk — Hn, variances PB

a

lh

=

a

l + ah - IPkhO-kVh, and correlations pkh,u = (