A recent study examining computerized adaptive testing with the MMPI-2 in a sample of 140 Veterans Affairs hospital patients revealed substantial savings in items administered and duration of administration, with minimal impact on test validity. However, in another study that examined the equivalence of MMPI-2 results for 571 undergraduates across booklet, conventional computerized, and adaptive computerized versions there were several significant differences in mean scale scores across testing formats for both men and women.
Independent of concerns raised by these findings, the design adopted in this study provides an exemplar by which equivalence of testing formats for other instruments should also be examined. Outside the domain of paper-and-pencil personality measures, evidence for the equivalence of computerized adaptations of psychological tests is weak or nonexistent, as noted by and others. Of particular interest is the low congruence between computerized and structured clinical interviews. To what should we attribute these findings?
Paradoxically, non congruence may result from respondents’ greater willingness to divulge personally sensitive information to computer-based administrations. Thus, whereas individual item responses on computer-based administrations may have greater validity, conclusions drawn from these responses (or their aggregates) may have lower validity (and result in over diagnosis) if based on criterion cutoffs derived from traditional non computerized procedures. A further consideration regarding equivalence studies involves the appropriate reference for evaluating results.
Specifically, the equivalence across testing formats should not be expected to exceed reliability within formats. Kappa statistics examining congruence in diagnoses derived from computer-based versus interviewer-based administrations of the Diagnostic Interview Schedule (DIS) were modest (mean I s across diagnoses = . 49, . 51, and . 63 across three studies). However, these were comparable to kappas reflecting congruence within interviewer-based DIS administrations (mean I s = . 38 and . 63) and within computer-based administrations (mean I = .59).
That is, equivalence coefficients across methods, although modest, were no lower than repeated administrations within format. (Dzida 1998) Apart from psychometric equivalence, advocated examining similarities or discrepancies in the perceptual, emotional, and attitudinal reactions of respondents to computerize versus traditional procedures. Little is known about test stimulus properties, characteristics of the respondent, or behaviors of the clinician that moderate such experiential equivalence.
For example, greater anxiety or discomfort in one format relative to another may adversely affect participation in the assessment process. Similarly, not much is known about respondent characteristics that moderate psychometric equivalence. Honaker and Fowler noted that much of the extant equivalence research is based on non-clinical populations and may not generalize to clinical situations. (Sage 2001) Separate from questions of equivalence across administration formats are issues of internal consistency and temporal stability for computer-administered measures.
As noted earlier, computerized adaptive testing may reduce the standard error of measurement by optimizing the subset of items administered to each individual. However, computerized administration may either increase or decrease reliability for any given individual depending on the extent to which it influences attention to content, ease of correcting unintended responses, completion of items initially omitted, and similar factors. (Dean 2003) An additional caution regarding reliability of CBTIs concerns the reliability of narratives themselves.
Several authors have noted that CBTIs have (theoretically) perfect reliability in producing a given set of narrative statements from a given set of responses. However, because most CBTI systems translate continuous variables (test scores) into discrete variables (eg, 3 to 4 unique score ranges) when linking scale scores to interpretive statements, small variations in test scores can produce very different clinical descriptions. Consequently, any studies of equivalence (whether within- or between-testing formats) should examine not only the congruence of test scores but also the congruence of the CBTI narrative output.
Finally, the difficulty in interpreting any non congruence between computerized and non computerized versions of a measure warrants further discussion. For various reasons already noted, computerized administration of a measure may elicit either less valid or more valid responses than its non computerized counterpart, and this effect may vary across individuals. Thus, studies of equivalence are insufficient for supporting or refuting the accuracy of psychological measures in either format, and linkage of test findings to relevant non test criteria remains essential. (Hooghiemstra 1999)