In a typical questionnaire testing situation, examinees are not allowed to choose which items they answer because of a technical issue in obtaining satisfactory statistical estimates of examinee ability and item difficulty. This paper introduces a new item response theory IRT model that incorporates information from a novel representation of questionnaire data using network analysis. Three scenarios in which examinees select a subset of items were simulated. In the first scenario, the assumptions required to apply the standard Rasch model are met, thus establishing a reference for parameter accuracy. The second and third scenarios include five increasing levels of violating those assumptions.
Journal of Educational Measurement39 Three scenarios in which examinees select a subset of items were simulated. Cheung, G. In addition, the existence of ERS Bushy thong result in biased estimates of item parameters and, in turn, could contaminate the precision of the target latent traits Bolt and Newton, ; Jin and Wang, a. Table 1. Overall, the false positive rate was well controlled, particularly with RZTesting winbugs adaptive irt models an average false positive rate of 1. In this model, the latent classes with respect to different response styles i.
Finesse escorts london thai. Original Research ARTICLE
ClustVarLV clusters variables around latent variables. Additional items would be necessary if detecting these extreme latent values were of interest. System design and operation. These limitations notwithstanding, we believe Testing winbugs adaptive irt models circumstances will improve over time, with the publication of WinBUGS code collections and tutorials similar to this one. Latest posts by nthompson see all. Among other things, the purpose of IRT is to provide a framework for evaluating how well assessments work, and how well individual items on assessments work. These parameters are used to graphically display adaltive item response function IRF. Additional required packages are loaded automatically e. This example lacks items that are sensitive to latent characteristics below 1. Functions for correlation theory, Testing winbugs adaptive irt models validity generalizationreliability, item analysis, inter-rater reliability, and classical utility are contained in the psychometric package. The package multiplex is especially designed for social networks with relations at different levels. Hardouin JB. Also, nothing about Mode,s refutes human development or improvement or assumes that a trait level is fixed.
- Item response theory IRT represents an important innovation in the field of psychometrics.
- In psychometrics , item response theory IRT also known as latent trait theory , strong true score theory , or modern mental test theory is a paradigm for the design, analysis, and scoring of tests , questionnaires , and similar instruments measuring abilities, attitudes, or other variables.
In a typical questionnaire testing situation, examinees are not allowed to choose which items they answer because of a technical issue in obtaining satisfactory statistical estimates of examinee ability and item difficulty.
This paper introduces a new item response theory IRT model that incorporates information from a novel representation of questionnaire data using network analysis. Three scenarios in which examinees select a subset of items were simulated. In the first scenario, the assumptions required to apply the standard Rasch model are met, thus establishing a reference for parameter accuracy.
The second and third scenarios include five increasing levels of violating those assumptions. To the best of our knowledge, this is the first proposal to obtain satisfactory IRT statistical estimates in the last two scenarios. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper and its Supporting Information files. Competing interests: The authors have declared that no competing interests exist. Item response theory IRT comprises a set of statistical models for measuring examinee abilities through their answers to a set of items questionnaire. This property, known as invariance, is obtained by introducing separate parameters for the examinee abilities and item difficulties [ 1 ].
The IRT models have optimal properties when all items within a test are mandatory for the examinees to answer. In contrast, if the examinees are allowed to choose a subset of items, instead of answering them all, the model estimates may become seriously biased.
This problem has been raised by many researchers [ 2 — 8 ] but still lacks a satisfactory solution. This problem is an important issue because several studies have provided evidence that choice has a positive impact in terms of educational development [ 9 — 12 ].
That is, these studies indicated that allowing students to choose which questions to answer increases motivation and engagement in the learning process. In a testing situation, allowing choice seems to reduce the concern of examinees regarding the impact of an unfavourable topic [ 13 ]. In addition, it has been claimed as a necessary step for improving educational assessment [ 14 , 15 ]. For example, to achieve invariance, the items used in different tests must be calibrated in the same scale.
This calibration is usually done by creating a question bank from which the selected items are extracted. Items in the bank were previously calibrated by being exposed to examinees who have similarities to those from whom the tests are intended. Therefore, these items were pre-tested. The pre-test process is typically extremely expensive and time consuming. Nevertheless, serious problems were reported during the pre-test, such as the employees of a private high school supposedly making illegal copies of many items and their subsequent release.
In addition, the number of items currently available in the national bank was approximately 6 thousand, whereas the ideal number would be between 15 and 20 thousand.
All these events were harshly criticized by the mainstream media [ 16 — 19 ]. Recent works have proposed optimal designs for item calibration to reduce costs and time [ 20 , 21 ]. Still, there is a limit on the number of items that an examinee can properly answer within a period of time.
This paper presents a new IRT model to adjust data generated using an examinee choice allowed scenario. The proposed model incorporates network analysis information using Bayesian modelling in the IRT model.
To the best of our knowledge, this study is the only proposal to date that achieves a satisfactory parameter estimation in critical scenarios reported in the literature [ 5 ]. This problem can be solved by choosing arbitrary scales for either the ability or the item parameters. Using classic frequentist statistical analysis, the mean parameter of the abilities is usually set to zero. Using Bayesian modelling, the scale constraint can be imposed by means of the prior distribution of.
Further information regarding Bayesian estimation of IRT models can be found in [ 23 ]. Given , the set of responses can be written as [ 28 ], where denotes the observed values and denotes the missing values. Thus, the likelihood function given in Eq 2 is rewritten [ 5 ] as shown in Eq 5 : 5 and the joint posterior distribution is given by Eq 6 : 6. Bradlow and Thomas [ 5 ] showed that if examinees are allowed to choose items, then valid statistical inference for and can be obtained using Eq 2 only if the assumptions in Eqs 7 and 8 hold: 7 8.
Assumption 7 is known as the missing at random MAR assumption and implies that examinees are not able to distinguish which items they would likely answer correctly.
Further details are found in [ 5 ]. If both assumptions 7 and 8 hold, then the posterior distribution can be rewritten as shown in Eq 9 : 9. In this case, it is assumed that the process that generates missing data is non-informative. Details about statistical inference in the presence of missing data are found in [ 28 ]. In Bayesian inference, can be treated as a random quantity details in [ 28 ] ; thus, it can be sampled from the posterior distribution [ 25 ].
This sampling is possible because Eq 5 is an augmented data likelihood. Hereafter, it is assumed that if the examinees are randomly selecting items, then the unobserved values are MAR; otherwise, the process that generates missing data is informative, and the unobserved values are not MAR. In the experiment, students indicated which items in pairs of questions they would rather answer. However, they still had to answer all items. For example, a particular pair of items items 11 and 12 was introduced to the students.
The complete data set had 5, examinees and items. In the simulation study, 50 items were mandatory, and the remaining items were divided into 75 choice pairs. The authors also stated that very little is known about the nature and magnitude of realistic violations of those assumptions.
Wang et al. Nevertheless, the authors state that if assumption 7 or assumption 8 is violated, as described in [ 5 ], valid statistical inferences are not obtained using the standard IRT model or the proposed model with the choice effect parameter. The model accounts for the random selection of items using a nominal response model [ 32 ]. The number of possible choices, denoted by , is equal to the number of different combinations of choosing v of the V items, i.
The proposed model uses the vector of choice combinations rather than the missing data indicator vector. The probability density function PDF is known as the missingness distribution, and vector is assumed to be conditionally independent of and , given and. The authors performed two simulation studies in which the proposed model yielded satisfactory parameter recovery.
The model proposed by Liu and Wang [ 31 ] has a limitation. The estimates of the model are possible only if the number of possible choices is relatively small. The authors indicate that the number of examinees should be at least 10 times the number of item parameters in order to achieve reliable estimates in the nominal response model equation.
As will be shown, this paper evaluates a scenario in which examinees are allowed to choose 20 items from a total of 50 items. This scenario requires a nominal response model with item parameters.
Consequently, the minimum number of examinees required to estimate the model is 9. Pena and Costa [ 33 ] proposed a novel representation of examinees and their selected items using network analysis. The data set was coded as layers, vertices or nodes and edges. That is, the data set is initially represented as M single-layer networks or a multilayer network [ 35 ].
From the multilayer network, two matrices are created. Under the null hypothesis, the statistical distribution of the number of incident edges in each pair of vertices is the same. Further details are found in [ 33 ]. Therefore, matrix U is a binary matrix that preserves only the statistically significant edges, as shown in Eq 15 : In [ 33 ], it was shown that the density of matrix U can be used to test whether the MAR assumption holds.
The density of a network is given in Eq 16 [ 37 ]: Further details about eigenvector centrality are found in [ 39 ]. In addition, the simulation study presented in [ 33 ] indicates that the larger the MAR assumption violation is, the larger the correlation between the eigenvector centrality and item difficulties.
It is worth mentioning that the eigenvector centrality assumes values within the range 0—1. Therefore, it provides a standardized measure of vertex centrality. In general, the relation between item difficulty b i and the first eigenvector of matrix O can be written as shown in Eq This paper proposes a new IRT model that takes into account the relation shown in Eq The posterior distribution, shown in Eq 6 , can be rewritten as shown in Eq 20 : 20 where Eq 21 assumes that, given , the missing data indicator is independent of and.
Further information about the conditioning on covariates for the missingness mechanism becoming negligible are found in [ 40 , 41 ]. It is worth mentioning that other functions to describe the relation between and can be proposed. To investigate the data behaviour under the violations of assumptions 7 and 8 , we performed several simulations using three different scenarios.
The complete data set can be represented by a 1, versus 50 dimensional matrix of binary responses. In the first scenario, hereafter named scenario 1, both assumptions 7 and 8 were valid.
That is, the items were randomly selected by the examinees, and consequently, the process that causes missing data was non-informative. Therefore, this is the scenario in which valid statistical inference can be obtained using the standard Rasch model. The second scenario, named scenario 2, is identical to the first simulation scenario presented in [ 33 ].
In this group, the examinee has a probability of less than 0. A weight value w i is assigned to the items in each group. In this scenario, assumptions 7 and 8 are violated. Finally, in the third scenario, named scenario 3, the examinee choice depends on y mis. That is, assumption 7 is violated. This scenario is similar to the second simulation study described in [ 5 ]. Table 1 summarizes the proposed simulations. Each configuration was replicated times.
It is worth mentioning that the selected items were generated using a multinomial probability distribution.
Additional rotation methods for FA based on gradient projection algorithms can be found in the package GPArotation. He is primarily interested in the use of AI and software automation to augment and replace the work done by psychometricians, which has provided extensive experience in software design and programming. The dataset is the bfi dataset in the R package psych [ 36 ], which contains the responses of 2, subjects to 25 personality self-report items. Classical test theory CTT is approximately years old, and still remains commonly used because it is appropriate for certain situations, and it is simple enough that it can be used by many people without formal training in psychometrics. Item Response Theory for Psychologists. Downing SM.
Testing winbugs adaptive irt models. Installation (any Windows computers)
This study aims to develop a new class of higher-order mixture IRT models by integrating mixture IRT models and higher-order IRT models to address these practical concerns.
The proposed higher-order mixture IRT models can accommodate both linear and nonlinear models for latent traits and incorporate diverse item response functions. The Rasch model was selected as the item response function, metric invariance was assumed in the first simulation study, and multiparameter IRT models without an assumption of metric invariance were used in the second simulation study.
A larger sample size resulted in a better estimate of the model parameters, and a longer test length yielded better individual ability recovery and latent class membership recovery. The linear approach outperformed the nonlinear approach in the estimation of first-order latent traits, whereas the opposite was true for the estimation of the second-order latent trait.
Additionally, imposing identical factor loadings between the second- and first-order latent traits by fitting the mixture bifactor model resulted in biased estimates of the first-order latent traits and item parameters. Finally, two empirical analyses are provided as an example to illustrate the applications and implications of the new models. Multiple latent traits measured by a battery of tests are often correlated and can be assumed to contain a higher-order structure based on substantive knowledge.
In certain cases, large-scale measurements, such as the Programme for International Student Assessment PISA , can be treated as a measurement of three-order latent traits in which multiple domains e. In this case, the domains, subjects, and general concepts can be treated as the first-, second-, and third-order latent traits, respectively. Higher-order IRT models have the ability to estimate lower and higher-order latent traits simultaneously and can enhance the testing efficiency using the higher-order latent trait as an indicator of overall assessment for examinees and the lower-order latent traits as an indicator of formative assessment for an overview, see Huang et al.
In major studies, the applications of mixture IRT models focus on measuring a single latent trait and assume that examinee responses from different subgroups follow a specific Rasch model Rasch, with different latent trait distributions and different item parameter sets e. However, De Boeck, Cho, and Wilson developed a mixture IRT model under a multidimensional structure to explain the causes of differential item functioning between latent classes, and De Jong and Steenkamp extended unidimensional mixture IRT models to a multidimensional mixture IRT model for a study on cross-cultural comparisons.
However, potential higher-order relationships among latent traits are rarely discussed in the literature in relation to factor mixture models or mixture IRT models. Because mixtures of latent classes occur in multiple latent traits, the assumption that multiple orders of latent traits may have different mixtures of distributions among latent classes is justified.
Thus, a new class of higher-order mixture IRT models is developed in this study. Recently, Cho, Cohen, and Kim developed the mixture bifactor model by accommodating mixtures of latent classes in the bifactor model, and to the best of our knowledge, their study was the first attempt to measure the general and specific domains while simultaneously investigating the effects of different response patterns on both types of dimensionality.
Although the mixture bifactor model shares several similarities with our higher-order mixture IRT models in terms of model formulation, the conceptualization differs completely between the two types of models, and the differences should be noted. First, from a measurement perspective, the latent trait specified for each test or testlet are considered nuisance dimensionality in the mixture bifactor model but measure-intended variables in higher-order mixture IRT models.
Second, from a modeling perspective, bifactor models are mathematically equivalent to higher-order IRT models, and the development of higher-order mixture IRT models may appear redundant. Such equivalence between the two models is true when a common latent trait and several specific latent traits are measured; however, if additional common latent traits with additional hierarchical structures are involved, then the current mixture bifactor model is not applicable.
In addition, a linear function in the relationship among latent traits in bifactor models is assumed, whereas higher-order IRT models allow for a nonlinear function among second- and first-order latent traits. We will demonstrate the development and application of nonlinear higher-order mixture IRT models in the following sections.
Fourth, in a general bifactor model for a testlet, the factor loadings or discrimination parameters on the specific latent traits can be estimated separately from the factor loadings that are estimated for the common latent trait Y.
However, the mixture bifactor model of Cho et al. We will illustrate the difference in model derivations between the two types of models in detail in the next section. Based on these previous studies, creating a new class of higher-order mixture IRT models that serves as an extension and supplement to the current mixture bifactor model will be of considerable value.
The following sections first introduce mixture IRT models and higher-order IRT models and subsequently elaborate on the development of the new class of higher-order mixture IRT models. Two empirical examples are presented to demonstrate the applications and implications of the new models.
The final section provides our conclusions for the new models as well as suggestions for future research. When the item response function follows the three-parameter logistic model 3PLM; Birnbaum, , a mixture 3PLM can be specified, and the probability of a correct response to item i for person n of latent class g can be formulated as. Because the item parameters contain the subscript g for each item, a different set of item parameters can be estimated for different latent classes.
Although mixture IRT models are flexible and can accommodate a great diversity of item response functions, the mixture Rasch model has been widely applied to latent class detection e.
Therefore, we expended considerable efforts to assess the higher-order mixture Rasch model in terms of its estimation efficiency in the following simulation study. In addition to the Rasch model, the higher-order mixture 2PLM and 3PLM were evaluated via simulations and compared with the corresponding mixture bifactor models.
Variations of Equation 3 can be attained if higher-order latent traits have polynomial and interaction effects on low-order latent traits. In the following section, the polynomial or nonlinear formulation will be illustrated. Assuming a two-order structure with one common second-order latent trait and several first-order latent traits, for simplicity, the probability of a correct response to item i in test v for person n in the higher-order 3PLM is.
Other model extensions for higher-order latent traits within the framework of IRT models are referred to in the work of Huang et al.
If manifest groups are either not available or not reliable, a mixture IRT model can be implemented De Boeck et al. Let g 1 and g 2 be the indices of latent classes that arise in the first- and second-order latent traits. Thus, the mixtures of latent classes can be accommodated in the higher-order 3PLM by formulating the probability of correctly answering item i of test v for person n within classes g 1 and g 2 as follows:. Higher-order mixture IRT models are very flexible and general such that different orders can have different numbers of latent classes.
For example, one may assume two latent classes in the second order and four latent classes in the first order based on substantive theory or empirical findings.
For simplicity and ease of interpretation, the same number of latent classes was assumed across orders in this study; therefore, the subscripts g 1 and g 2 were reduced to g. In the following simulation studies, when the Rasch model is used as the item response function, we assumed that all latent classes had an identical set of factor loadings i.
It is worth noting the major differences between the model proposed here and the mixture bifactor model proposed by Cho et al. Corresponding to the mixture bifactor model Cho et al.
Therefore, the general and specific factors share a common set of factor loadings or discrimination parameters for each latent class. Such constraints in the mixture bifactor model limit its applicability because the factor loadings are important indicators of the effect that the general and specific factors have on each test item in real testing situations.
In addition, the subscripts g 1 and g 2 , which indicate the class membership in the first and second orders, respectively, reduce to a single indicator of g because the number of classes is constrained to the same number for the general and specific factors in the mixture bifactor model. Issues of measurement invariance deserve further attention in higher-order mixture IRT models.
If one of these terms lacks a g 1 subscript e. The imposition of equivalence constraints on the discrimination parameters across items and the factor loadings across classes implies that the assumption of metric invariance is satisfied. However, the assumption of scalar invariance is not fulfilled because the item difficulties are allowed to differ for each latent class De Boeck et al. When appropriate, the constraints of equal factor loadings and first-order residual variances or identical discrimination parameters across latent classes and items can be relaxed.
Two conditions that can be used to determine whether the metric is invariant will be demonstrated in the following simulation studies. For identification purposes, the mean second-order latent trait for one class e.
A further constraint should be considered when the metric invariance is violated. Note that the constraints on the first-order residual or the specific factor variances are not necessary in the mixture bifactor model Cho et al. Setting these constraints on model parameters to identify the model is common practice in the literature on mixture IRT models e.
The proposed models mentioned above assume a linear relationship between higher- and lower-latent traits. In addition, because the mixtures of distributions in the second-order latent trait are of particular interest in this study and the second-order latent trait is often formulated to include polynomials in a nonlinear factor model, an investigation of the effects of mixtures of latent classes on model parameter estimations in a nonlinear or quadratic higher-order IRT model is justifiable.
Therefore, a nonlinear regression curve may be a plausible alternative and can be applied to higher-order mixture IRT models in the form of a polynomial of degree r on the second-order latent trait, where the relationship between the second- and first-order latent traits is given by. Consider the quadratic factor model as an example i. To ensure that the polynomials in the second-order latent trait are mutually orthogonal, a different type of parameterization is used to replace Equation 9 with.
Similar to the linear higher-order mixture IRT model, the model identification settings, latent trait and residual distributions, and measurement invariance features can directly apply to the nonlinear higher-order mixture IRT model. Following the constraints in the linear higher-order mixture Rasch model, we assumed that all latent classes shared a set of regression weights; thus, the subscript g 1 was omitted from Equations 9 to Assume that both orders have the same latent classes g 1 and g 2 simplify to g and let the parameter space in the higher-order mixture Rasch model be denoted as.
The joint posterior distribution of the parameters can then be expressed as. Thus, Bayesian estimation with Markov chain Monte Carlo MCMC methods were implemented to produce the full conditional distributions of the parameters to represent the joint posterior distributions, and the mean of the marginal posterior density was treated as the parameter estimate of interest.
Before applying the Bayesian estimation, a prior distribution for each parameter is required. The same priors were set in the following simulations and empirical analyses. For the item parameters, a normal prior distribution with a mean of 0 and a variance of 4 was used for the difficulty parameters, a lognormal distribution with a mean of 0 and a variance of 1 was used for the discrimination parameters, and a beta prior with both hyperparameters equal to 1 was set for the pseudo-guessing parameters.
For the person parameters, a normal prior distribution with a mean of 0 and a variance of 10 was used for the second-order latent trait mean and a normal prior distribution with a mean of 0. A gamma prior distribution with both hyperparameters equal to 0. A categorical prior distribution was set for the indicators of latent classes with a conjugate Dirichlet distribution in which the hyperparameters were set to one.
Li et al. To assess the efficiency of the proposed models and examine the parameter recovery, the following two simulations were conducted: one for the higher-order mixture Rasch model, which included linear and nonlinear approaches, and the other for the higher-order mixture 2PLM and 3PLM, which were compared with the mixture bifactor model. Each study contained five first-order latent traits i. For the linear higher-order mixture Rasch model, the two manipulated factors were sample size 1, or 2, persons and test length 20 or 30 items in each test.
The factor loadings were set to 0. For the nonlinear higher-order mixture Rasch model, a quadratic higher-order mixture Rasch model was used to generate simulated data. The sample size 2, or 3, persons and test length 20 or 30 items in each test were varied.
A larger sample size was used in the second simulation study because the nonlinear mixture model requires additional subjects to obtain a stable estimation. The ability distributions for the second-order latent trait for both classes were identical to those of the linear approach except that the mean second-order latent trait across classes was constrained to zero.
The same example was used to obtain residual variances with values of 0. In the second simulation study, the higher-order mixture 2PLM and 3PLM were used to generate responses by 2, examinees to 20 or 30 items in each test.
The settings related to the distributions of the second- and first-order latent traits were set to the same values used in the linear higher-order mixture Rasch model.
The majority class had the same factor loadings as the first simulation study of the linear approach, and the minority class had factor loading values of 0. When the item responses were generated, we used the generating model and its corresponding mixture bifactor model to fit the data and assess the consequences of implementing a misleading constraint of identical factor loadings in the mixture bifactor model.
For both simulations, the item parameters were generated with the distribution described below. A value of 0. The item discrimination parameters were generated from a uniform distribution between 0. Each test had a common pseudo-guessing parameter that was generated from a uniform distribution between 0.