Item calibration samples and the stability of achievement estimates and system rankings

Another look at the PISA model

Abstract

Using an empirically-based simulation study, we show that typically used methods of choosing an item calibration sample have significant impacts on achievement bias and system rankings. We examine whether recent PISA accommodations, especially for lower performing participants, can mitigate some of this bias. Our findings indicate that standard operational methods, while not ideal, recover underlying proficiency reasonably well and generally outperform methods that more completely include all participants. Translating results onto the PISA scale, the calibration sample can induce bias of up to 12.49 points, which is important given that standard errors are around three points. Although ranking correlations are at least.95, we note the policy implications of slight ranking changes. Our findings indicate that limited accommodations targeted at low achieving educational systems do not outperform either of the other methods considered. Research that further explores accommodations for heterogeneous populations is recommended.

Item calibration samples and the stability of achievement estimates and system rankings

Item calibration samples and the stability of achievement estimates and system rankings

TOP