Measurement decision theory for classification with polytomous items
Number Of Pages:
15 1.5 spaced (5055 words)
Number Of Sources:
Type of Document:
I have a draft with comments from reviewers. I need to re-write it according to the comments (or another paper, but near this topic)
Comments to the Author
The manuscript looks at CAT for mastery testing (dichotomous classification) with polytomous items under Rudner (2009)’s framework. Basically the authors conducted simulation studies and reported results on one aspect of Rudner’s work, namely the Shannon entropy selection criterion, with polytomous responses. I have the following comments:1. Simulation studies are thin and routine;2. Having polytomous outcome variable is not really of much difference to the dichotomous as statistical procedures (likelihood, posterior in their general forms) are virtually the same;3. There is a substantial literature on Shannon entropy based CAT/classification; see, for example, Xu, Chang and Douglas (2003, NCME), Weissman (2007, EPM) and Wang (2013, EPM). In particular, Xu, et al. (2003) and Wang (2013) cover the more general diagnostic classification models and Weissman covers multi-category classification. These papers are certainly relevant and in fact should perform more or less similar as Bayesian decision approach is typically not much of difference to other approaches. These papers are unfortunately not referenced in this manuscript.A minor one:On page 2, acronym CCT is first used, but not spelled out.
Comments to the Author
This paper discusses measurement decision theory (MDT) as an underlying model for computerized adaptive testing (CAT) in the context of classification, and reports results from a simulation study that examines whether MDT can be successfully applied to the case of polytomously-scored items. Although the basic idea of applying the MDT CAT to polytomous items is novel, I have some reservations as discussed below.The usefulness and applicability of MDT CAT with dichotomous items have been discussed in the literature. And, it is entirely conceivable that the general characteristics and benefits of MDT CAT be generalizable to polytomous items. A simple demonstration of applying MDT CAT to polytomous items, in my opinion, would not be sufficiently worthy of publication in JEM.Moreover, the paper seems to fail to provide enough evidence that MDT CAT is an effective model that can be used in broader contexts. Showing classification accuracy levels of .83-.90 provides only limited information about the general performance of the model because each testing program has its own desired level of classification accuracy. Also, the simulation conditions seemed to be missing several important study factors, which obviously limits generalizability of the results reported in the paper.The effectiveness of the MDT model can only be assessed objectively when it is compared to other competing models. I would strongly suggest that the simulation study include both the traditional IRT classification testing and MDT CAT and compare their relative classification efficiency and accuracy. Some of the statements and conclusions in favor of MDT CAT mentioned in the paper do not seem to be directly supported by the results of the study.One way to increase the contribution of the current paper would be to include a few other study factors in the simulation. For example:• Although it is not explicitly mentioned, the number of score categories for the polytomous items used in the simulation appeared to be 3. It would be useful to include at least one more condition with a larger number than 3.• The actual size of item pools used in the simulation was not reported in the paper. It may be that MDT CAT is more (or less) sensitive to the size of an item pool compared to the traditional IRT method. It would be very useful to examine the relative performance of the MDT CAT and traditional CAT under varying item-pool size conditions.• Calibration sample size could also be considered.• The cut score was set at the 70% percentile. One or more different cut-score locations could be considered.• It would also be interesting to look at a case where there are more than one cut points (i.e., multiple category classifications).Several levels of the maximum test length were specified in the simulation study. How many times, for each condition, was the maximum test length utilized while the Wald’s SPRT stopping rule was in place? Did the cases that reached the maximum length largely represent the proportion of inaccurate classifications? How can the classification-accuracy results be interpreted in terms of the relationship between the max test length and SPRT stopping rule? For example, does the extent to which the Wald’s SPRT stopping rule is efficient have anything do to with the level of accurate classifications?Can MDT CAT be used for “ordered” polytomous score categories only? Whether the answer is yes or no, this issue should be discussed.The first six pages of the manuscript were devoted to explaining the MDT CAT model, which is very similar to the description provided in Rudner (2009). I would suggest that the authors pay attention to the issues of both similarity and length of that part in the next revision. Last, but not the least, writing would need to be improved substantially. There are so many loose ends and grammatical errors in the text.
Comments to the Author
This manuscripts presents some interesting research using Measurement Decision Theory – a high-quality model that is applicable in many situations where Item Response Theory models are not. In particular this paper shows the applicability of MDT to items scored with partial credit. The topic of this paper is very interesting, very applicable, and well within the scope of JEM. There are, however, serious problems with this paper that make it unsuitable, as is, for JEM.Foremost, the paper too closely follows the logic, construction, and language of Rudner (2009). The submitted paper reads like the Rudner paper, except for a few paragraphs and the numbers. Most of the lists of references in this paper are taken verbatim from the Rudner paper. Worse yet, there are multiple entire paragraphs that either exact copies or very close copies. This is not acceptable in scholarly literature. The entire paper needs to be that of the author. Further, it makes one wonder whether the author had even read the referenced papers. To be acceptable, a complete rewrite is needed.That said, I was impressed by the research. The author had a good design and ran some helpful and informative simulations. I have a few questions and comments.More information is needed about the simulated bank. The a parameter was generated N(.5, 2.0). That sd appears excessive and should be justified. Was there a minimum accepted a parameter? A min of .3 or .4 would make sense. Where there many items in the banks with a parameters approaching 0? Where there any with negative values?Individual item responses were based on the IRT PCM model rather than MDT probabilities. That is fine, but it should be discussed and justified.The results showed average test lengths and percent correct classifications at different maximum test lengths. It would be helpful to also show the number of test takers classified as a function of test length.Three different bank difficulties were tried? Why? Usually one tries to match the bank to the examiners.With pass/fail classifications, I have to wonder whether item selection was really adaptive. That is, is there any real difference in the order in which examines of different abilities get items? I suspect that the items were administered in entropy order, or something close, regardless of the examinee. If one has a high probability or a low probability of being a master, the most informative item will be the one that does the best job of separating masters from non-masters. I suspect more than 2 groups are needed for MDT CAT to make sense.