Share

## Some Common Errors in Interpreting Test Scores

By Gerald W. Bracey (Carol Fry Bohlin, Column Editor)
(News Bulletin, April 2004)

In these data-driven times, educators are confronted with an enormous number of statistics related to test scores and educational research. A discussion of a few common misconceptions follows:

1. Thinking that all students can be at or above "grade level."
This idea is often referred to as the "Lake Wobegon" effect, named for Garrison Keillor's mythical Minnesota town where "all the women are strong, all the men are good looking, and all the children are above average." Well, it's possible for a town, but not a nation, to have only children who are above average. That's because of the way "grade level" is generally defined. A publisher who is developing a test administers it to perhaps 200,000 students, selected to be demographically representative of the nation as a whole. This group of students is called the national norming sample. Its median score is called the national norm. At any particular grade, "grade level" is defined as the score of the student who gets the average (median) score for that grade. Thus, by definition, half of the nation's students are below grade level at any particular moment.

This definition often eludes the media and politicians, who often report expectable results as supposedly scandalous. In one such "scandal," 25 percent of the graduating seniors were not at twelfth-grade level. The reporter for the story was particularly disturbed because the district in question was an affluent one with a good reputation. But if half of the nation's students are by definition below grade level, having only 25 percent in that condition is pretty good.

2. Confusing ranks with scores.
If one arranges a set of numbers in ascending or descending order, the median is the point that divides the set of numbers in half. The median is also referred to as the 50th percentile rank. Note that a percentile is a rank, not a score. If your students' average score is at the 75th percentile, you know that, on average, they scored better on that test than 75 percent of the people in the national norming sample. You can't tell from ranks what their scores were or how well they did in any absolute sense. If they're eighth grade students taking the SAT, they did very well indeed. If they're eighth-grade students taking a fourth-grade test, they didn't do so well.

The confusion between ranks and scores takes on special importance when we are comparing the ranks of schools, districts, or nations. Ranks often make small differences in scores seem large. For example, on the 1995 TIMSS eighth-grade science test, U.S. students got 58 percent of the items correct. That was two points above the average (mean) score of all 41 nations in the study, ranking the U.S. 19th of 41. From this ranking alone, you cannot see that the actual scores were tightly bunched. In fact, if American students had managed to answer a mere 5 percent more of the items correct, they would have been ranked fifth in the world. If they had answered a mere five percent fewer of the items correctly, they would have fallen all the way to 29th.

3. Using the wrong unit of analysis.
In 2003, a study found that high school students in charter schools made greater academic gains than their counterparts in traditional public schools. Another study of the same data found that charter students made smaller gains than those in traditional schools. How could such a contradiction occur?

Let's first look at an analogy. The average SAT score for New Jersey in 2002 was 498. In Mississippi, it was 559. If we consider the state as the unit of analysis, we get an average SAT score of 529 [(498 + 559)/2] for New Jersey and Mississippi. But it's a nonsensical average because of the considerable difference in the number of students taking the SAT in these two states. Only 4 percent (1,213) of the students in Mississippi took the SAT, while almost 80 percent (71,163) of the students in New Jersey huddled in angst on Saturday mornings that year. But using the state as the unit of analysis lets Mississippi's 1,271 carry as much weight as New Jersey's 71,163.

In the charter school studies, one researcher used the high school as the unit of analysis; the other used the student. The student is the proper unit of analysis because using the school as the unit creates the same kind of problem as using the state as the unit for calculating SAT averages. High schools vary in size, and charter high schools are typically much smaller than traditional schools. Using the school as the unit of analysis gives the smaller charter schools the same weight as the larger traditional schools. Using the student as the unit gives large schools more weight just as it gives states with large numbers of test takers more weight in figuring an SAT average. The proper conclusion is that the academic gains of traditional public school students were somewhat greater over a four-year period than those students in charter schools.

In your district, if a district average is calculated using the school as the unit of analysis and if the schools in your district differ in size, school comparisons will not be accurate.

References

Raymond, Margaret E. The Performance of California Charter Schools. Palo Alto, Calif.: Hoover Institution, May 2003. credo.stanford.edu/Performance%20of%20California%20
Charter%20School.FINAL.complete.pdf.

Rogosa, David R. Student Progress in California Charter Schools, 1999-2002. Stanford, Calif.: Stanford University, June 2003. www-stat.stanford.edu/~rag/api/charter9902.pdf.

The following books by Gerald W. Bracey provide additional information related to this topic:

• Bail Me Out! Handling Difficult Data and Tough