91制片厂视频

Special Report
Standards

Quality of Questions on Common Tests at Issue

By Stephen Sawchuk 鈥 January 21, 2010 9 min read
  • Save to favorites
  • Print
Email Copy URL

Most experts in the testing community have presumed that the $350 million promised by the U.S. Department of 91制片厂视频 to support common assessments would promote those that made greater use of open-ended items capable of measuring higher-order critical-thinking skills.

But as measurement experts consider the multitude of possibilities for an assessment system based more heavily on such questions, they also are beginning to reflect on practical obstacles to putting such a system into place.

The issues now on the table include the added expense of those items, as well as sensitive questions about who should be charged with the task of scoring them and whether they will prove reliable enough for high-stakes decisions. Also being confronted are matters of governance鈥攖he quandary of which entities would actually 鈥渙wn鈥 any new assessments created in common by states and whether working in state consortia would generate savings.

鈥淭he reality is that it does cost more to base a system on open-ended items, no question about it,鈥 said Scott Marion, the vice president of the Dover, N.H.-based Center for Assessment, a test-consulting group, who is advising several states. 鈥淚f the model we鈥檙e thinking about has got to be on-demand and high-stakes and used in systems with scores that are returned quickly, then it鈥檚 going to cost a lot.鈥

Higher Costs?

State dependence on multiple-choice testing under the federal No Child Left Behind Act has led to a backlash by those who say the tests, while cheap and technically reliable, come at a cost: not measuring complex cognitive skills.

Uphill Climb

A push for open-ended assessments comes after states reduced their use in the wake of the No Child Left Behind Act.

BRIC ARCHIVE

Source: Government Accountability Office

Using a slice of money from the $4.35 billion Race to the Top Fund, created last year under the American Recovery and Reinvestment Act, U.S. Secretary of 91制片厂视频 Arne Duncan has called for state consortia to craft richer item types aligned to common standards that would include constructed-response questions, extended tasks, and performance-based items in which students would apply knowledge in new ways. (鈥淪timulus Seeks Enriched Tests,鈥 Aug. 12, 2009.)

His department is expected to open up a competition in March for the assessment aid from the stimulus law. Work is under way, meanwhile, on a national project to produce a 鈥渃ommon core鈥 of academic standards for adoption by states.

No Redos

Assessment experts caution that open-ended test items carry with them a number of practical challenges. For one, items that measure higher-order skills are generally more expensive to devise, depending on how extensive the items are and how much of the total test such items make up.

For instance, with their detailed prompts and scenarios, questions that require students to engage in extensive writing or to defend their answer choices often are 鈥渕emorable,鈥 meaning the items can鈥檛 be reused for many years and must be replaced in the meantime.

Wes Bruce, the chief assessment officer for the Indiana education department, recalled one prompt that required 5th graders to write about what would happen if a kangaroo bounded into the classroom.

鈥淎ll across the state, kids were talking about the prompt,鈥 he said. 鈥淔rom an assessment perspective, that鈥檚 not good. Teachers will use it [subsequently] as an example for classroom work.鈥

The scoring process for open-ended items is also far more complicated than sticking a bunch of test papers into a computer scanner. It relies on 鈥渉and scorers鈥 who are trained according to a scoring guide for each question and a set of 鈥渁nchor papers鈥 that give examples of performance on the item at each level. Each open-ended item typically goes through multiple reviews to ensure consistent scoring.

Depending on the complexity of the item and how long it takes to score, the costs can increase dramatically. A short constructed-response item with four possible scores might take one minute to score, said Stuart R. Kahl, the president of Measured Progress, a nonprofit test contractor based in Dover, N.H. But an extended performance-based or portfolio item might take up to an hour, he said.. With test scorers paid in the range of $12 to $15 an hour, such costs would add up.

For a midsize state with about 500,000 students within the tested grades and subjects, the scoring of tests based even partly on constructed-response items would make up more than a fifth of the total annual contract cost, Mr. Kahl estimated.

Scoring Scenarios

For some, the idea of expanding human-scored items raises issues of reliability: Performance-based items are typically less mathematically reliable than those based entirely on multiple choice.

Todd Farley, a 15-year veteran of the test-scoring business who detailed his experiences in a recent book, Making the Grades, is among the skeptics. In the book, Mr. Farley alleges that the scoring guidelines for open-ended items were frequently counterintuitive, and that as a 鈥渢able leader鈥濃攁n individual supervising other scorers鈥 work鈥攈e occasionally changed other reviewers鈥 scores.

Though test publishers interviewed for this story dismissed Mr. Farley鈥檚 account, independent sources do point to areas of concern. At least two reports issued by the 91制片厂视频 Department鈥檚 office of inspector general last year, for instance, found lapses in and oversight of test contractors charged with scoring open-ended items.

In part to ameliorate the errors and costs associated with human scoring, test publishers are investing heavily in automated-response systems that use artificial-intelligence software to 鈥渞ead鈥 student answers.

Such programs are already in use to minimize the number of human scorers needed for major tests such as the Graduate Record Examination. A handful of states, including Indiana, have piloted the technology for their own tests, and some experts, like Joan Herman, a director of the National Center for Research on Evaluation, Standards, and Student Testing, or , a group of assessment researchers headquartered at the University of California, Los Angeles, believe that the technology will be widespread within five years.

鈥淲hen that happens, it will open up entirely new windows for doing more complex, open-ended items on large-scale assessments,鈥 Ms. Herman said.

But such systems are not perfect, and experts including Mr. Bruce, in Indiana, noted that one of their limitations is that they typically don鈥檛 generate interpretative information about where a student needs to improve.

And in an era in which most teachers are distanced from the assessment process, unions and other stakeholders argue that teachers should have a greater role in the development and scoring of assessments, which can serve an important function for gathering information on student performance.

Teacher scoring of assessments has generally been eschewed under the NCLB law, the current version of the Elementary and Secondary 91制片厂视频 Act, in part because of fears that the law鈥檚 accountability pressure would cause educators to inflate their students鈥 scores. But other countries have tackled the problem by using a blind-scoring process, in which teachers meet in central locations to score exams with names removed.

Hiring substitute teachers and paying the scoring teachers for their time would be costly as well. But there are benefits in the form of professional development, said Marcia Wilbur, the executive director of curriculum and content development for the College Board鈥檚 Advanced Placement program, which uses such a system to grade open-ended essays on AP exams.

Teachers, Ms. Wilbur said, spend a week looking at sample responses at each level of the scoring guides before they begin to score, and the guides often help them better understand students鈥 learning progressions.

鈥淭eachers will bare their souls about what they do in their classrooms while discussing the samples,鈥 she said. 鈥淚t鈥檚 not that they go back to their classrooms and teach to the test, but they have a better understanding of the skills and the features of the skills that students struggle with.鈥

Another model, drawn from international practice in places such as Australia, England, Hong Kong, and New Zealand, would rely on teachers鈥 scoring assessments that are embedded within classroom curricular activities. Kentucky used such a system in the days before the 8-year-old NCLB law.

Not only could such assessments provide better feedback on where instruction needs to be improved, but they also could be included in state gauges of achievement if there were a central auditing process to ensure appropriate administration and reliable scoring, CRESST鈥檚 Ms. Herman said.

Consortia Challenges

Aside from costs, the problem with systems that rely heavily on teacher scoring, Mr. Kahl argued, is that results cannot be scored as quickly or efficiently as those done by machine. That means it could prove tough to turn results around under the quick timeline envisioned by the NCLB law and current state accountability systems.

Both Ms. Herman and Mr. Kahl said it might be possible to build systems that coupled curriculum-embedded assessments, scored throughout the year, with information from more-typical 鈥渙n demand鈥 standardized tests. But such a model hasn鈥檛 been tested in the United States.

鈥淲hile most people agree that some amount of analyzing student work is good professional development, every teacher probably doesn鈥檛 have to do 100 papers to get the full value of it,鈥 Ms. Herman added.

Also uncertain is which entities would actually own test forms or items developed in common, and which would bear the responsibility for updating them. Under the current system, such decisions are now made through contracts with individual states.

A common test, Mr. Marion of the Center for Assessment said, argues for some kind of open-source setup, but the details are 鈥渄icey,鈥 he concedes. 鈥淭he reality is that if this is paid for out of stimulus funds, the country should own them,鈥 he said, 鈥渂ut at some point, someone has to house these items.鈥

Mr. Kahl, meanwhile, warns that one of the major selling points about state consortia鈥攃heaper tests鈥攎ight yield only limited savings for some states. Consortia would probably ease costs more for small states, where test development is a high percentage of overall assessments, but might not help larger states, where most testing costs involve factors such as printing, scoring, and reporting results.

Such questions of governance, finance, and sustainability have drawn the concern of policy experts, too.

Doubts and Hopes

Chester E. Finn Jr., the president of the Thomas B. Fordham Institute, warned in a recent article for the Washington think tank鈥檚 newsletter that the federal competition could lock in features of an assessment system that would be difficult to change in the future.

He also worries that the Obama administration鈥檚 ambitious goals for the assessment funding鈥攚hich include generating information about both school and student performance as well as data about teacher effectiveness鈥攃ould prove to be irreconcilable.

鈥淚f all the glitterati ... remains in the grant competition, anyone that wants to win the competition is going to have to pretend they can do all those things,鈥 Mr. Finn said in an interview. 鈥淏ut since we know that they can鈥檛 all be done by the same assessment, in the same period of time at a finite price, something will get left in the dust.鈥

But for all those challenges, state testing experts hope to see breakthroughs with the federal funding.

鈥淭hey are expecting that there be some innovation in the assessment area,鈥 Mr. Bruce of Indiana said. 鈥淎s a state that has been committed to using what at one point were innovative item types, and is still looking at ways to innovate in the scoring, it鈥檚 exciting.鈥

Coverage of the American Recovery and Reinvestment Act is supported in part by grants from the William and Flora Hewlett Foundation, at , and the Charles Stewart Mott Foundation, at .
A version of this article appeared in the January 27, 2010 edition of 91制片厂视频 Week as Open-Ended Test Items Pose Challenges

Events

Recruitment & Retention Webinar Keep Talented Teachers and Improve Student Outcomes
Keep talented teachers and unlock student success with strategic planning based on insights from Apple 91制片厂视频 and educational leaders.鈥
This content is provided by our sponsor. It is not written by and does not necessarily reflect the views of 91制片厂视频 Week's editorial staff.
Sponsor
Families & the Community Webinar
Family Engagement: The Foundation for a Strong School Year
Learn how family engagement promotes student success with insights from National PTA, AASA鈥痑nd leading districts and schools.鈥
This content is provided by our sponsor. It is not written by and does not necessarily reflect the views of 91制片厂视频 Week's editorial staff.
Sponsor
Special 91制片厂视频 Webinar
How Early Adopters of Remote Therapy are Improving IEPs
Learn how schools are using remote therapy to improve IEP compliance & scalability while delivering outcomes comparable to onsite providers.
Content provided by 

EdWeek Top School Jobs

Teacher Jobs
Search over ten thousand teaching jobs nationwide 鈥 elementary, middle, high school and more.
Principal Jobs
Find hundreds of jobs for principals, assistant principals, and other school leadership roles.
Administrator Jobs
Over a thousand district-level jobs: superintendents, directors, more.
Support Staff Jobs
Search thousands of jobs, from paraprofessionals to counselors and more.

Read Next

Standards Explainer What鈥檚 the Purpose of Standards in 91制片厂视频? An Explainer
What are standards? Why are they important? What's the Common Core? Do standards improve student achievement? Our explainer has the answers.
11 min read
Photo of students taking test.
F. Sheehan for EdWeek / Getty
Standards Florida's New African American History Standards: What's Behind the Backlash
The state's new standards drew national criticism and leave teachers with questions.
9 min read
Florida Governor and Republican presidential candidate Ron DeSantis speaks during a press conference at the Celebrate Freedom Foundation Hangar in West Columbia, S.C. July 18, 2023. For DeSantis, Tuesday was supposed to mark a major moment to help reset his stagnant Republican presidential campaign. But yet again, the moment was overshadowed by Donald Trump. The former president was the overwhelming focus for much of the day as DeSantis spoke out at a press conference and sat for a highly anticipated interview designed to reassure anxious donors and primary voters that he's still well-positioned to defeat Trump.
Florida Governor and Republican presidential candidate Ron DeSantis speaks during a press conference in West Columbia, S.C., on July 18, 2023. Florida officials approved new African American history standards that drew national backlash, and which DeSantis defended.
Sean Rayford/AP
Standards Here鈥檚 What鈥檚 in 贵濒辞谤颈诲补鈥檚 New African American History Standards
Standards were expanded in the younger grades, but critics question the framing of many of the new standards.
1 min read
Vice President Kamala Harris speaks at the historic Ritz Theatre in downtown Jacksonville, Fla., on July 21, 2023. Harris spoke out against the new standards adopted by the Florida State Board of 91制片厂视频 in the teaching of Black history.
Vice President Kamala Harris speaks at the historic Ritz Theatre in downtown Jacksonville, Fla., on July 21, 2023. Harris spoke out against the new standards adopted by the Florida state board of education in the teaching of Black history.
Fran Ruchalski/The Florida Times-Union via AP
Standards Opinion How One State Found Common Ground to Produce New History Standards
A veteran board member discusses how the state school board pushed past partisanship to offer a richer, more inclusive history for students.
10 min read
Image shows a multi-tailed arrow hitting the bullseye of a target.
DigitalVision Vectors/Getty