Quality of Questions on Common Tests at Issue

Save to favorites
Print

Copy URL

Most experts in the testing community have presumed that the $350 million promised by the U.S. Department of 91制片厂视频 to support common assessments would promote those that made greater use of open-ended items capable of measuring higher-order critical-thinking skills.

But as measurement experts consider the multitude of possibilities for an assessment system based more heavily on such questions, they also are beginning to reflect on practical obstacles to putting such a system into place.

The issues now on the table include the added expense of those items, as well as sensitive questions about who should be charged with the task of scoring them and whether they will prove reliable enough for high-stakes decisions. Also being confronted are matters of governance鈥攖he quandary of which entities would actually 鈥渙wn鈥� any new assessments created in common by states and whether working in state consortia would generate savings.

鈥淭he reality is that it does cost more to base a system on open-ended items, no question about it,鈥� said Scott Marion, the vice president of the Dover, N.H.-based Center for Assessment, a test-consulting group, who is advising several states. 鈥淚f the model we鈥檙e thinking about has got to be on-demand and high-stakes and used in systems with scores that are returned quickly, then it鈥檚 going to cost a lot.鈥�

Higher Costs?

State dependence on multiple-choice testing under the federal No Child Left Behind Act has led to a backlash by those who say the tests, while cheap and technically reliable, come at a cost: not measuring complex cognitive skills.

A push for open-ended assessments comes after states reduced their use in the wake of the No Child Left Behind Act.

Source: Government Accountability Office

Using a slice of money from the $4.35 billion Race to the Top Fund, created last year under the American Recovery and Reinvestment Act, U.S. Secretary of 91制片厂视频 Arne Duncan has called for state consortia to craft richer item types aligned to common standards that would include constructed-response questions, extended tasks, and performance-based items in which students would apply knowledge in new ways. (鈥淪timulus Seeks Enriched Tests,鈥� Aug. 12, 2009.)

His department is expected to open up a competition in March for the assessment aid from the stimulus law. Work is under way, meanwhile, on a national project to produce a 鈥渃ommon core鈥� of academic standards for adoption by states.

No Redos

Assessment experts caution that open-ended test items carry with them a number of practical challenges. For one, items that measure higher-order skills are generally more expensive to devise, depending on how extensive the items are and how much of the total test such items make up.

For instance, with their detailed prompts and scenarios, questions that require students to engage in extensive writing or to defend their answer choices often are 鈥渕emorable,鈥� meaning the items can鈥檛 be reused for many years and must be replaced in the meantime.

Wes Bruce, the chief assessment officer for the Indiana education department, recalled one prompt that required 5th graders to write about what would happen if a kangaroo bounded into the classroom.

鈥淎ll across the state, kids were talking about the prompt,鈥� he said. 鈥淔rom an assessment perspective, that鈥檚 not good. Teachers will use it [subsequently] as an example for classroom work.鈥�

The scoring process for open-ended items is also far more complicated than sticking a bunch of test papers into a computer scanner. It relies on 鈥渉and scorers鈥� who are trained according to a scoring guide for each question and a set of 鈥渁nchor papers鈥� that give examples of performance on the item at each level. Each open-ended item typically goes through multiple reviews to ensure consistent scoring.

Depending on the complexity of the item and how long it takes to score, the costs can increase dramatically. A short constructed-response item with four possible scores might take one minute to score, said Stuart R. Kahl, the president of Measured Progress, a nonprofit test contractor based in Dover, N.H. But an extended performance-based or portfolio item might take up to an hour, he said.. With test scorers paid in the range of $12 to $15 an hour, such costs would add up.

For a midsize state with about 500,000 students within the tested grades and subjects, the scoring of tests based even partly on constructed-response items would make up more than a fifth of the total annual contract cost, Mr. Kahl estimated.

Scoring Scenarios

For some, the idea of expanding human-scored items raises issues of reliability: Performance-based items are typically less mathematically reliable than those based entirely on multiple choice.

Todd Farley, a 15-year veteran of the test-scoring business who detailed his experiences in a recent book, Making the Grades, is among the skeptics. In the book, Mr. Farley alleges that the scoring guidelines for open-ended items were frequently counterintuitive, and that as a 鈥渢able leader鈥濃€攁n individual supervising other scorers鈥� work鈥攈e occasionally changed other reviewers鈥� scores.

Though test publishers interviewed for this story dismissed Mr. Farley鈥檚 account, independent sources do point to areas of concern. At least two reports issued by the 91制片厂视频 Department鈥檚 office of inspector general last year, for instance, found lapses in and oversight of test contractors charged with scoring open-ended items.

In part to ameliorate the errors and costs associated with human scoring, test publishers are investing heavily in automated-response systems that use artificial-intelligence software to 鈥渞ead鈥� student answers.

Such programs are already in use to minimize the number of human scorers needed for major tests such as the Graduate Record Examination. A handful of states, including Indiana, have piloted the technology for their own tests, and some experts, like Joan Herman, a director of the National Center for Research on Evaluation, Standards, and Student Testing, or , a group of assessment researchers headquartered at the University of California, Los Angeles, believe that the technology will be widespread within five years.

鈥淲hen that happens, it will open up entirely new windows for doing more complex, open-ended items on large-scale assessments,鈥� Ms. Herman said.

But such systems are not perfect, and experts including Mr. Bruce, in Indiana, noted that one of their limitations is that they typically don鈥檛 generate interpretative information about where a student needs to improve.

And in an era in which most teachers are distanced from the assessment process, unions and other stakeholders argue that teachers should have a greater role in the development and scoring of assessments, which can serve an important function for gathering information on student performance.

Teacher scoring of assessments has generally been eschewed under the NCLB law, the current version of the Elementary and Secondary 91制片厂视频 Act, in part because of fears that the law鈥檚 accountability pressure would cause educators to inflate their students鈥� scores. But other countries have tackled the problem by using a blind-scoring process, in which teachers meet in central locations to score exams with names removed.

Hiring substitute teachers and paying the scoring teachers for their time would be costly as well. But there are benefits in the form of professional development, said Marcia Wilbur, the executive director of curriculum and content development for the College Board鈥檚 Advanced Placement program, which uses such a system to grade open-ended essays on AP exams.

Teachers, Ms. Wilbur said, spend a week looking at sample responses at each level of the scoring guides before they begin to score, and the guides often help them better understand students鈥� learning progressions.

鈥淭eachers will bare their souls about what they do in their classrooms while discussing the samples,鈥� she said. 鈥淚t鈥檚 not that they go back to their classrooms and teach to the test, but they have a better understanding of the skills and the features of the skills that students struggle with.鈥�

Another model, drawn from international practice in places such as Australia, England, Hong Kong, and New Zealand, would rely on teachers鈥� scoring assessments that are embedded within classroom curricular activities. Kentucky used such a system in the days before the 8-year-old NCLB law.

Not only could such assessments provide better feedback on where instruction needs to be improved, but they also could be included in state gauges of achievement if there were a central auditing process to ensure appropriate administration and reliable scoring, CRESST鈥檚 Ms. Herman said.

Consortia Challenges

Aside from costs, the problem with systems that rely heavily on teacher scoring, Mr. Kahl argued, is that results cannot be scored as quickly or efficiently as those done by machine. That means it could prove tough to turn results around under the quick timeline envisioned by the NCLB law and current state accountability systems.

Both Ms. Herman and Mr. Kahl said it might be possible to build systems that coupled curriculum-embedded assessments, scored throughout the year, with information from more-typical 鈥渙n demand鈥� standardized tests. But such a model hasn鈥檛 been tested in the United States.

鈥淲hile most people agree that some amount of analyzing student work is good professional development, every teacher probably doesn鈥檛 have to do 100 papers to get the full value of it,鈥� Ms. Herman added.

Also uncertain is which entities would actually own test forms or items developed in common, and which would bear the responsibility for updating them. Under the current system, such decisions are now made through contracts with individual states.

A common test, Mr. Marion of the Center for Assessment said, argues for some kind of open-source setup, but the details are 鈥渄icey,鈥� he concedes. 鈥淭he reality is that if this is paid for out of stimulus funds, the country should own them,鈥� he said, 鈥渂ut at some point, someone has to house these items.鈥�

Mr. Kahl, meanwhile, warns that one of the major selling points about state consortia鈥攃heaper tests鈥攎ight yield only limited savings for some states. Consortia would probably ease costs more for small states, where test development is a high percentage of overall assessments, but might not help larger states, where most testing costs involve factors such as printing, scoring, and reporting results.

Such questions of governance, finance, and sustainability have drawn the concern of policy experts, too.

Doubts and Hopes

Chester E. Finn Jr., the president of the Thomas B. Fordham Institute, warned in a recent article for the Washington think tank鈥檚 newsletter that the federal competition could lock in features of an assessment system that would be difficult to change in the future.

He also worries that the Obama administration鈥檚 ambitious goals for the assessment funding鈥攚hich include generating information about both school and student performance as well as data about teacher effectiveness鈥攃ould prove to be irreconcilable.

鈥淚f all the glitterati ... remains in the grant competition, anyone that wants to win the competition is going to have to pretend they can do all those things,鈥� Mr. Finn said in an interview. 鈥淏ut since we know that they can鈥檛 all be done by the same assessment, in the same period of time at a finite price, something will get left in the dust.鈥�

But for all those challenges, state testing experts hope to see breakthroughs with the federal funding.

鈥淭hey are expecting that there be some innovation in the assessment area,鈥� Mr. Bruce of Indiana said. 鈥淎s a state that has been committed to using what at one point were innovative item types, and is still looking at ways to innovate in the scoring, it鈥檚 exciting.鈥�

Stephen Sawchuk

Assistant Managing Editor, 91制片厂视频 Week

Stephen Sawchuk is an assistant managing editor for 91制片厂视频 Week, leading coverage of teaching, learning, and curriculum.

Coverage of the American Recovery and Reinvestment Act is supported in part by grants from the William and Flora Hewlett Foundation, at , and the Charles Stewart Mott Foundation, at .
A version of this article appeared in the January 27, 2010 edition of 91制片厂视频 Week as Open-Ended Test Items Pose Challenges