By: Stuart Kahl
Despite educational testing officials’ efforts to produce user-friendly reports of student performance on their tests, all too often they find their “consumers” misunderstanding and/or misinterpreting results. My purpose in writing this blog post is to produce explanations of scaled scores and performance levels that may be shared widely to help address this problem. A longer paper to which I provide a link offers more detailed explanations.
A test score is only meaningful in a relative sense – that is, relative to the performance of a referent group, to a predetermined standard of performance, or to previous performance. Thus, to say Maxine earned 36 points on a test tells us nothing. The same is true for the percentage of total possible points Maxine earned. We don’t know from these “raw scores” and without additional information how she performed relative to other students in her school or state, we don’t know if her score met or exceeded some threshold for “passing,” and we don’t know if her score in some way reflected improved performance. We don’t know if the test was a hard test or an easy one, and we have no sense from the score alone what Maxine knows and is able to do. To rectify this situation characteristic of raw scores, testing companies undertake three important tasks — scaling, equating, and standard setting – in preparation for the reporting of results.
“Scaling scores” is just transforming raw scores from their own unique scale to a standard scale. Although testing folks might use some sophisticated techniques to do this, it is not unlike changing a set of temperatures from degrees Centigrade to degrees Fahrenheit. Sometime in the past, we’ve all seen the simple linear equation that is used to accomplish this. Recognize some of the temperature equivalents below?
In the context of testing, suppose raw scores on the third-grade test Maxine took could range from 0 to 60 points. The testing company could transform the scores on that test so that the scaled scores fall between 300 and 399. (The company could make fourth graders’ scaled scores on their test fall in the 400s, the fifth graders’ scores in the 500s, and so on. And the same score ranges could be used for tests in different subjects.) Such transformations would still preserve the ordering and relative distances between scores. For example, if Maxine’s raw score was just slightly higher than Maria’s but a lot higher than Michael’s, the same would be true for their scaled scores.
Sometimes there are multiple forms of a test in a subject at a grade. Different test forms can be used in the same year or in successive years. Taking the latter situation as an example, consider math test forms administered statewide in the third grade in successive years. While efforts can be made to produce two equivalent tests, they are still likely to vary somewhat in terms of difficulty. If the two are scaled independently in their respective years, yielding the same mean score, we have a problem. If the second year’s test is tougher, then a particular score the first year would not have the same meaning as the same score the second year in terms of a student’s capability.
Thinking of state averages, if because of effective statewide efforts to improve instruction or greater teacher familiarity with curriculum standards (instructional objectives), the statewide performance has improved, the second year’s state mean would not reflect this improvement accurately unless test difficulty was taken into account. Thus, in producing scaled scores, different test forms have to be equated. Again, while the testing experts may use some sophisticated techniques for equating, the ultimate effect of the process is equivalent to adjustments in scores based on test difficulty. With improved student performance statewide, the second year average scaled score should be higher than that for the previous year.
Standard Setting and Performance Levels
Scaled scores have some shortcomings. For one, a school’s average scaled score does not convey information about the distribution of scores within a school. Two schools could have the same average scaled score, but one could have many more students near the bottom of the score range, for example. Additionally, scaled scores do not communicate anything about what students know and can do. These are two reasons for the use of performance levels in the reporting of results for state tests and commercial testing programs.
Many states identify four performance levels corresponding to four distinct intervals in the student scaled score range. Three “cut scores” have to be identified to divide the students up into the four groups based on their scores. This is accomplished by a process called standard setting, which involves panels of educators and non-educators making judgments about student work or test item requirements. The judges match the student work or items to previously developed performance level definitions, which describe the general capabilities of students at the various levels. The aggregation of the judges’ decisions lead directly to cut scores. With tests being equated across years, standard setting only needs to be done in the first year of a testing program. Cut scores would remain the same after that.
The graphic below depicts the relationship between scaled scores and performance levels. It shows a scaled score distribution with a mean score of 350 and standard deviation 25. It also shows three hypothetical cut scores and the percentages of student scores in the four different performance levels. (By the way, instead of transforming raw scores to another scale with a “nice” mean, the psychometricians often transform scores to a scale that has “nice” numbers for two of the cut points – e.g., 340 or 360.) I show a normal distribution below because state test scores would be approximately normally distributed. Some people believe that because of an emphasis on performance level reporting, we are getting away from the normal distribution. Nothing could be further from the truth. A test that is unusually easy or hard would show a skewed distribution of scores (the hump shifted right or left), but otherwise test scores would be and should be approximately normally distributed.
In some ways, information is lost when we move from scaled scores to performance levels. As one could see from the graphic, two students scoring at the opposite extremes of the same performance level, say “Basic,” would have performed very differently on the test, yet would be assigned to the same performance level with the same general description of capabilities. In the same vein, two students scoring close to the same cut score, but just on opposite sides, would be put in different performance levels with different descriptions of capabilities. This is why testing programs usually report both scaled scores and performance levels in parent reports. It is very appropriate for state accountability testing to focus on the percentage of students in a school who are designated “proficient or above.” But a parent’s statement, “My child is proficient in math, but basic in reading,” could be misleading. Similarly, some competency-based advocates who believe or behave as if students are either competent or not need to recognize that there is a broad continuum of performance quality associated with complex, higher order tasks and that evidence from a wide variety of such tasks is needed to gauge competence.
Interested in learning more?
Please refer to Scaled Scores and Performance Levels: The Nuts and Bolts Behind Them and Issues with Their Use for additional information.