Written by Alan Tucker, State University of New York—Stony Brook

(*News Bulletin,* May/June 2004)

I want to alert readers to some troubling flaws that can affect the standards-based mathematics tests mandated by recent state and federal legislation. These problems became apparent to me during my work on a New York Board of Regents’ special panel investigating the high failure rate on the June 2003 New York mathematics graduation test. Our investigation found major deficiencies both in the setting and in the maintaining of performance standards. Many of these problems involve very technical aspects of the psychometric methodology, based on Item Response Theory, for maintaining a constant performance standard over time. My analysis indicated that a proper application of the psychometric methodology would have set a true cutoff for passing the June 2003 Math A test in the low 30s instead of at 51 out of 85, as required by the test. Detailed reports can be found at www.ams.sunysb.edu/~tucker.

Here is what Item Response Theory claims to do in a nutshell: it can calculate a consistent passing score on future tests on the basis of the projected performance of a hypothetical student whose probability of solving a certain 'bookmark' problem correctly is twothirds. The bookmark problem is chosen by a panel of experts from a set of problems that a group of students have worked. The bookmark process assumes that experts and student test-takers will order problems in the same way with respect to their level of difficulty. Unfortunately, experts ranked the difficulty of the Math A test problems in quite a different order from the students, as represented by their performance on the problems. The fact that students had been drilled on some "hard" types of problems and had become rusty on some "easy" types accounted for the discrepancy. The artificially elevated passing score in June 2003 was the result of four categories of psychometric flaws involved in maintaining a constant performance standard over time: (1) improper choice of a key parameter in item response curves used to predict the probabilities of success on test questions, (2) failure of "anchor items" (used in all field tests to equate performance from year to year) to detect improving student skills, (3) poor distribution of question difficulties (the test evolved to have a bimodal distribution of quite easy and quite hard questions with almost no questions close to the intermediate difficulty level of the performance standard), and finally, (4) inconsistent methodologies from different psychometric vendors. Any test with a hint of creative problem-solving (i.e., the problems are not totally predictable) is likely to be affected by many of these psychometric problems, especially if the test is aiming to raise the performance of students over time.

**Implications.** It is very difficult to design a reliable standardsbased test that can consistently measure, over time, the sort of mathematical reasoning that a high-quality K-12 mathematical education should develop in future citizens. A major component of No Child Left Behind is sanctions on schools at which one or more defined cohort of students does not improve suitably over time at standards-based tests. Requiring a school to raise its proficiency rate, say, from 32 to 35 over one year is statistically meaningless if the margin of error in year-to-year comparisons is at least 5 points. Teachers, school administrators, and the general public need to be aware that the proficiency rates on state tests are probably not comparable over time at a level of precision that would justify high-stakes consequences for failure to make a 10 percent improvement, unless a test consisted of highly predictable questions for which students could be drilled in a rote fashion.

A more detailed article about the Math A test by Alan Tucker can be downloaded at http://www.ams.sunysb.edu/~tucker/StandardsProb.pdf.