Using Student Test Scores to Measure Teacher Performance: Some ...

13 downloads 7714 Views 379KB Size Report
Mar 3, 2015 - account data on student growth as a significant factor; (3) evalu- ate teachers and ..... based on administrative records and are asked to delete students who should not ...... vanderbilt.edu; Twitter:@eduspringer. His research ...
574904 research-article2015

EDRXXX10.3102/0013189X15574904Educational Researcher

Feature Articles

Using Student Test Scores to Measure Teacher Performance: Some Problems in the Design and Implementation of Evaluation Systems Dale Ballou1 and Matthew G. Springer1

Our aim in this article is to draw attention to some underappreciated problems in the design and implementation of evaluation systems that incorporate value-added measures. We focus on four: (1) taking into account measurement error in teacher assessments, (2) revising teachers’ scores as more information becomes available about their students, and (3) and (4) minimizing opportunistic behavior by teachers during roster verification and the supervision of exams.

Keywords: accountability; educational policy; policy analysis; regression analyses; statistics; teacher research

Introduction Race to the Top (RTTT) is a competitive grant program created under the American Recovery and Reinvestment Act of 2009. RTTT provides incentives for states to reform K–12 education in such areas as turning around low-performing schools and improving teacher and principal effectiveness. To date, the U.S. Department of Education (USDOE) has awarded 19 states more than $4.35 billion to implement RTTT reforms. These states serve approximately 22 million students and employ 1.9 million teachers in 42,000 schools, representing roughly 45 percent of all K–12 students and 42 percent of all low-income students (USDOE, 2014). As part of RTTT, USDOE called for states and their participating school districts to improve teacher and principal effectiveness by developing comprehensive educator evaluation systems. State and district educator evaluation system plans were reviewed by USDOE to ensure districts (1) measure student growth for each individual student; (2) design and implement evaluations systems that include multiple rating categories that take into account data on student growth as a significant factor; (3) evaluate teachers and principals annually and provide feedback, including student growth data; and (4) use these evaluations to inform decisions regarding professional development, compensation, promotion, retention, tenure, and certification (USDOE, 2009). The development of educator evaluation systems in which one component is student performance on standardized tests is unpopular with some teachers and controversial among

statisticians. American Federation of Teachers President Randi Weingarten has called for an end to using value-added measures as a component of teacher evaluation systems (Sawchuk, 2014). Much has been said and written about the difficulty of drawing valid statistical inferences about teacher quality from student test scores (American Statistical Association, 2014; Harris, 2011; McCaffrey, Lockwood, Koretz, & Hamilton, 2003). It is not our intent to repeat those arguments here. We suspect that such reforms are here to stay and that test-based measures of teacher performance will be incorporated into teacher evaluation systems with increasing frequency. On the whole we regard the use of educator evaluation systems as a positive development, provided judicious use is made of this information. No evaluation instrument is perfect; every evaluation system is an assembly of various imperfect measures. There is information in student test scores about teacher performance; the challenge is to extract it and combine it with the information gleaned from other instruments. Our aim in this article is to draw attention to some underappreciated problems in the design and implementation of evaluation systems that incorporate value-added measures. We focus on four: (1) taking into account measurement error in teacher assessments, (2) revising teachers’ scores as more information becomes available about their students, and (3) and (4) minimizing opportunistic behavior by teachers during roster verification

1

Vanderbilt University, Nashville, TN

Educational Researcher, Vol. 44 No. 2, pp. 77­–86 DOI: 10.3102/0013189X15574904 © 2015 AERA. http://er.aera.net Downloaded from http://er.aera.net at VANDERBILT UNIVERSITY LIBRARY on August 17, 2015

March 2015    77

and the supervision of exams. In conclusion, we offer some practical guidance and recommendations. Taking Account of Estimation Error Teacher value-added estimates are notoriously imprecise. If value-added scores are to be used for high-stakes personnel decisions, appropriate account must be taken of the magnitude of the likely error in these estimates. Otherwise decisions based on them will be unfair to teachers. It is important to note that it is not our purpose in this article to offer a general critique of the treatment of estimation error in value-added assessment. Rather, we focus on two problems that we find in teacher evaluation systems’ being implemented in RTTT states: (1) systems that ignore estimation error altogether and (2) systems that rely on t-statistics as a summary measure of teacher performance.

Ignoring Estimation Error Evaluation systems based on the Colorado Growth Model (Betebenner, 2008, 2011) or a close variant are particularly likely to ignore estimation error. An example is the Georgia teacher evaluation system (Georgia Department of Education, 2013). Student performance on standardized tests is compared to that of students who scored at the same percentile in the prior year. Where students fall in this year’s distribution (of identically scoring students the previous year) is used to gauge teacher effectiveness. If the average percentile across a teacher’s students falls below 30, the teacher is deemed ineffective with respect to this component of the evaluation system. The probability that this happens to a particular teacher is a function of three things: the teacher’s true effectiveness, the variability of student performance for reasons other than teacher quality (including test measurement error), and the number of students a teacher has. We have presented some illustrative calculations in the appendix to this article, comparing a teacher of 25 students to a teacher of 100 students. (The first might be an elementary teacher with a self-contained classroom, whereas the second teaches in a middle school using departmentalized instruction.) In our illustrative examples the former is 4 to 12 times more likely to be deemed ineffective, solely as a function of the number of the teacher’s students who are tested—a reflection of the fact that the measures used in such accountability systems are noisy and that the amount of noise is greater the fewer students a teacher has.1 Clearly it is unfair to treat two teachers with the same true effectiveness differently. Equally troubling, resources will be wasted if teachers are targeted for interventions without taking into account the probability that the ratings they receive are based on error.

Incorporating Measurement Error in a t-Statistic One of the most widely used methods for teacher valueadded assessment is the SAS Institute’s Educational ValueAdded Assessment System (EVAAS). SAS provides statewide EVAAS reporting to every district, public school, and charter school in North Carolina, Ohio, Pennsylvania, and Tennessee.

SAS EVAAS for K–12 reporting is also used by large, medium, and small districts and schools in many other states, including Arizona, Arkansas, California, Colorado, Connecticut, Delaware, Georgia, Illinois, Louisiana, Missouri, New Jersey, New York, South Carolina, Texas, Virginia, and Wyoming (SAS Institute, 2014). EVAAS incorporates measurement error into a single summary measure of performance in the form of a t-statistic, the ratio of the teacher’s value-added score to its standard error. This statistic or a transformation of it is then incorporated into the teacher’s evaluation. An example is the North Carolina teacher evaluation system, which classifies teachers into five categories based on the ratio of the teacher’s value-added score to its standard error (the teacher’s t-statistic). The numerator of this statistic is an estimate of how much the teacher in question differs from average. The denominator measures how precisely we have estimated the numerator. Clearly both are important. A large denominator relative to the numerator means we cannot be very confident that the teacher in question truly differs from the average: the value-added estimate is too imprecise. The North Carolina system classifies teachers into five categories based on this ratio: greater than 2, between 1 and 2, between 1 and −1, between −1 and −2, and less than −2. Teachers consistently in the bottom category may face sanctions. This use of a t-statistic is misguided. This statistic is conventionally used to test the hypothesis that a teacher differs significantly from average.2 An affirmative answer does not mean that the teacher in question is so bad that some kind of corrective action should be taken. It simply means that decision makers have enough information to rule out, with a high degree of confidence, the hypothesis that the teacher is average. A teacher who is just a little bit worse than average, but whose value added is estimated with a high degree of precision, can have a t-statistic less than −2. The teachers at greatest risk of this are those for whom we have a lot of data. It is not the high value of the numerator that puts the teacher in the lowest category but the small value of the denominator. Yet the goal in evaluation ought not to be to identify teachers who are worse than average, even if only by a small amount, but to identify teachers who are quite a bit worse than average, provided decision makers have sufficient confidence in that estimate. The t-statistic is the wrong tool for this job. A more reasonable approach would be to apply a two-step test to each teacher. The first step would focus on the numerator of the t-statistic, identifying those teachers whose estimated value added falls below (or above) some agreed-on threshold. If so, then one would proceed to the second step, asking whether the estimate is sufficiently precise for this teacher to be identified as a case requiring further attention or action.3 It is far from obvious that the answer to this second question should rely on a conventional standard for statistical significance (e.g., that the standard error be no greater than half the value-added estimate). The decision ought to be based on the relative costs of false positives and false negatives, not on the answer to the purely conventional question, Are they 95% confident the teacher differs from average? But even if this standard were used in the second step, such a two-step procedure would be an improvement over present practice.

78    EDUCATIONAL RESEARCHER Downloaded from http://er.aera.net at VANDERBILT UNIVERSITY LIBRARY on August 17, 2015

Revising Value-Added Estimates SAS EVAAS reassesses the performance of a teacher as more data become available about students that teacher has taught in the past. Thus, a fourth-grade teacher who receives a value-added score for the cohort he or she had in 2007–2008 will receive a revised score for that same cohort a year later in 2008–2009, taking into account subsequent performance of those students (basically, whether their progress is sustained). He or she will receive yet another revision in the 2009–2010 school year for the same cohort.4 This has confused teachers, who wonder why their value-added score keeps changing for students they had in the past. Whether or not there are sound statistical reasons for undertaking these revisions (a question to which we return below), revising value-added estimates poses problems when the evaluation system is used for high-stakes decisions. What will be done about the teacher whose performance during the 2013– 2014 school year, as calculated in the summer of 2014, was so low that the teacher loses his or her job or license but whose revised estimate for the same year, released in the summer of 2015, places the teacher’s performance above the threshold at which these sanctions would apply? States contemplating the use of EVAAS scores for high-stakes personnel decisions may attempt to evade this problem by using only the first set of estimates produced for a given cohort, ignoring subsequent revisions. This circumvents the problem only in the narrow sense: If you never look at the revised score, no highstakes decision will be called into question on the basis of later data. But if it is statistically sound practice to make these revisions, as the SAS Institute has always claimed, what is the response to teachers who claim that their revised scores would show they ought not to have lost their jobs? What is to prevent a discharged teacher from going to court and demanding they be produced? The state may request that the SAS Institute no longer furnish revised estimates, but it is not clear that even this would suffice, for such a teacher could demand they be produced, as they always were in the past. The issue becomes still messier when we consider whether the revised scores are actually better. Notwithstanding the oftrepeated claim that EVAAS uses a student’s entire history of test scores when estimating teacher value added, these calculations in fact have been based on a 5-year window of data—that is, the most recent 5 school years furnish the data used to produce the estimates (Ballou, Sanders, & Wright, 2004). With each passing year, the oldest data are dropped, and test results from the most recent year are added. This affects teachers quite differently, depending on the grades they teach. Consider a fourth-grade teacher in a state where standardized testing begins in Grade 3 and ends in Grade 8 (not an uncommon configuration). When his or her value-added score in 2008 is first calculated, using a data window that comprises the 2003–2004 through 2007– 2008 school years, the only other year for which they have test scores for this cohort of students is 2006–2007, when students were third graders. As time passes and the window slides forward, it will include increasing amounts of data about the 2007– 2008 cohort (as they advance through higher grades). Although there are some concerns about using post-fourth-grade performance to evaluate how much these students learned in fourth

grade, on balance it makes sense to revise the teacher’s valueadded estimate for the 2007–2008 school year: the individual performing the calculation has more information about those students than he or she had the first time it was calculated.5 This is not the case for an eighth-grade teacher: The situation is asymmetric. When his or her value-added score for 2007– 2008 is calculated, the 5-year data window includes information about those 8th-grade students in 4 prior years (if they have been enrolled in the system that long). In subsequent years, as the data window slides forward, EVAAS drops older years: first 2003– 2004, when students were in 4th grade, then 2004–2005, when they were in 5th grade, and so forth. At the other end it adds 2008–2009, 2009–2010—but there are no more data on these students in those years. As 9th and 10th graders, they have passed out of the testing regime.6 Thus, the revisions of an 8thgrade teacher’s EVAAS scores are based on fewer and fewer data about their students, not more. They are increasingly less reliable. To summarize, depending on a teacher’s grade level and the amount of information that subsequently becomes available about students he or she had in the past, it can make sense to revise her estimated value added. In some cases (e.g., the eighthgrade teacher described above), it clearly makes no sense to revise these estimates, as each revision is based on less information about student performance. Moreover, even when there are sound statistical reasons for carrying out such revisions, revising scores poses political and legal problems for states that want to make high-stakes personnel decisions in a timely manner.7 Roster Verification Value-added assessment requires data linking students to the teachers who provided instruction in tested subjects. Although one might hope that state administrative data systems would be sufficiently accurate for this purpose, frequently this is not the case.8 Indeed, as noted in a recent review of RTTT implementation conducted for the USDOE, “The major challenge reported by the largest number of SEAs was that current data systems make linking student test data to individual teachers difficult” (Webber et al., 2014). Because many state administrative data systems are not up to this challenge, many states have implemented procedures wherein teachers are called on to verify and correct their class rosters.9 Roster verification raises the obvious concern that teachers might fail to claim students they fear will lower their value-added scores. Indeed, one firm that specializes in developing software to assist with the linkage of students to teachers proclaims on its website, “We’ve empowered the teachers to have control over their student data, nobody knows who taught a student more than that student’s teacher” (RANDA Solutions, 2014). Although this may be true, the notion that teachers might manipulate their rosters in order to improve their value-added scores obtains indirect support from other studies of strategic behavior in response to high-stakes accountability. Such behavior can take the form of focusing excessively on a single test, altering test scores, or assisting students with test questions (Goodnough, 1999; Jacob & Levitt, 2005; Koretz & Barron, 1999). Schools have been found to engage in strategic March 2015    79

Downloaded from http://er.aera.net at VANDERBILT UNIVERSITY LIBRARY on August 17, 2015

classification of students as special education and limited English proficiency (Cullen & Reback, 2006; Deere & Strayer, 2001; Figlio & Getzler, 2002; Jacob, 2005). There is evidence that they employ discipline procedures to ensure that low-performing students will be absent on test day (Figlio, 2003), manipulate grade retention policies (Haney, 2000; Jacob, 2005), misreport administrative data (Peabody & Markley, 2003), and plan nutritionally enriched lunch menus prior to test day (Figlio & Winicki, 2005). These studies suggest that at least some teachers and schools will take advantage of virtually any opportunity to game a test-based evaluation system and that educator behavior will need to be monitored. The roster verification procedures implemented by state departments of education typically require principals or other administrators to corroborate the rosters teachers submit. How well busy administrators working with faulty data systems perform this function is open to question. The process is made more complicated and error prone by the fact that in many states, teachers are instructed to drop from their rosters students whose scores are not to be used in calculating teacher value added.10 To illustrate the potential for abuse of such processes, we have analyzed data for students in Grades 4 through 8 in one of the RTTT states where test scores in mathematics and reading/ English language arts are used to produce value-added assessments of teachers of those subjects. Our data for this analysis include student test scores for the 2011–2012 and 2012–2013 school years, linkages established between students and teachers via roster verification, and administrative records, including course enrollment files reporting the classes students take, the identity of their instructors, and the portion of the school year in which they were enrolled in a particular classroom and school. During roster verification, teachers receive prepopulated rosters based on administrative records and are asked to delete students who should not be used to calculate their value-added assessments. The latter fall into two main groups: students who spent fewer than the requisite number of days in a teacher’s class and students with certain categories of disabilities qualifying them for special education services.11 Using course enrollment files, attendance records, and special education enrollment records, we identify students who should not be claimed on rosters (“exempt”). Others (“nonexempt”) are expected to be claimed during roster verification. There are many discrepancies between our classification of students and the rosters that emerge from the verification process. This in itself is not compelling evidence that teachers are behaving strategically or, indeed, that there are any errors in the rosters that teachers submit: after all, the imperfections of administrative records are the reason that teachers are asked to verify rosters in the first place. However, a second fact, taken in conjunction with the first, strongly suggests teachers are behaving strategically: The students they do not claim have on average test scores far below those of the students who are claimed. Discrepancies can arise in two ways: when teachers claim students that they are instructed not to claim and when teachers fail to claim students they should. Clearly it is possible to make honest errors of both types, particularly given that determining whether a student should be claimed requires accurate attendance data for the entire year and, in many cases, information

about the rest of the student’s academic program (e.g., some categories of special education students should be claimed, and others should not). Similarly, it is possible (indeed, likely) that in relying on administrative records to ascertain which students were taught by which teachers, and for how long, we have also made mistakes. However, honest errors arising for haphazard reasons should not introduce systematic differences between the test performance of students claimed and students unclaimed: The two groups should look similar. They do not. In Table 1 we present results from a regression analysis predicting 2012–2013 student math scores as a function of a fourth-order polynomial in a student’s prior year math scores and an indicator for grade level in 2012–2013.12 All scores are expressed in scale score units, with virtually identical means and standard deviations at each grade level (overall mean = 749, overall standard deviation = 95). The sample is restricted to students for whom we are able to identify at least one math course and math instructor in the 2012–2013 course file and for whom prior year scores are available. We report regression coefficients, standard errors, and sample size for three categories of students: unclaimed, nonexempt; unclaimed, exempt; and exempt but claimed. The first and third categories represent discrepancies between the information we have gleaned from administrative records and rosters as verified by teachers. We note that the number of students in the first category is quite small: fewer than 1,000 statewide. There is little evidence here of widespread cheating in the sense that teachers are failing to claim students who should be claimed.13 More curious is the fact that teachers often fail to drop students whom we have placed in the exempt category. Indeed, a majority of the students we deem exempt are claimed by their teachers. In both cases, the regression coefficients tell a similar story. Students who are dropped from teachers’ rosters tend to be those whose test performance is far below what would be expected on the basis of grade level and past scores. Both the unclaimed, nonexempt and the unclaimed, exempt groups score two or more standard deviations below the level expected. In short, regardless of whether administrative records indicate a student is exempt or nonexempt, a student who is not claimed is very likely to be one who would lower teachers’ value added. The discrepant cases do not arise as the result of random errors but are systematically related to student test performance. Could these findings be due to classification errors on our part—misidentifying which students are truly exempt? Suppose that the students unclaimed during roster verification are precisely the set who should go unclaimed—for example, students with excessive absences. High levels of absenteeism would also explain their poor test performance. We are skeptical that this hypothesis accounts for our findings for the following reasons. First, there are 66,004 students in the exempt but claimed category. We identified these as exempt based primarily on attendance and special education records. Given that school funding is tied to the enrollment of special education students and to attendance and that these files are therefore carefully checked, it seems highly unlikely that so many of our exempt students are actually nonexempt. What about the possibility that unclaimed students were assigned to the wrong teacher in the administrative records and then slipped through the cracks during roster verification? There

80    EDUCATIONAL RESEARCHER Downloaded from http://er.aera.net at VANDERBILT UNIVERSITY LIBRARY on August 17, 2015

Table 1 Achievement Test Scores as a Function of the Teacher’s Claiming the Student Explanatory Variables Unclaimed, nonexempt     Exempt but claimed     Unclaimed, exempt     Sample size

(1)

(2)

(3)

–218.28 (1.12) n = 643 –2.94 (0.21) n = 66,004 –371.21 (0.36) n = 13,427 320,325

–248.48 (1.19) n = 548 –2.79 (0.20) n = 62,731 –377.39 (0.36) n = 12,333 256,278

–245.04 (1.25) n = 474 –2.80 (0.23) n = 38,554 –382.27 (0.39) n = 9,412 235,127

Note. Sample comprises students in Grades 4 through 8 who took the 2012–2013 spring achievement test in mathematics. Additional regressors include indicators of student grade level, a quadratic function of prior year mathematics score, and the number of days the student received instruction from his or her principal math instructor. Sample in Column 3 excludes students whose teachers claimed no students in the roster verification process. Sample in Column 4 excludes students who had more than one math teacher during the school year.

would be no presumption in such cases that unclaimed students would systematically underperform claimed students, particularly by such wide margins as are evident in our data. Finally, we have conducted two additional analyses to further reduce the likelihood of error on our part. In Column 2 we have dropped from the estimation sample students taught by teachers who were not linked to any students in the linkage file. There may be something special about these teachers we were unable to detect; moreover, they do not appear to be acting strategically in the sense of selectively failing to claim a subset of their students. This does not alter our findings. Finally, in Column 3 we have limited the sample to students who have only one math teacher in the course file—that is, we drop students who appear to have switched classes and teachers at some point during the year or who had multiple math classes (perhaps a regular math class and an extra help session). In such cases teachers and principals may have been confused about who ought to have claimed the student. Such cases may also arise more often among poor-performing students who are moved in and out of various instructional arrangements in an effort to find something that works. The remaining sample comprises the least ambiguous cases—students who, according to our administrative data, had only one math teacher for the entire year, and about whom there should have been no doubt about who was the responsible instructor. Again, the coefficients on unclaimed, nonexempt and unclaimed, exempt are strongly negative, suggesting that teachers drop students they perceive will hurt their value-added scores. We close this section by stressing what this analysis shows as well as what it does not show. To begin with the latter, our findings do not challenge the overall accuracy of value-added assessments in the state we have studied. A very small number of students fall in the unclaimed, nonexempt category. Even if all of them were claimed, the impact on teacher value added would be negligible. As for the teachers in the exempt but claimed category, their performance as a group is essentially average. Individual teachers may be helped or hurt by their failure to drop such students from their rosters, but there is no resulting bias in value-added scores on the whole.

Moreover, it is clear that processes for roster verification are required, given the imperfections of administrative data. There are benefits to allowing teachers to check their rosters. Teachers must buy into any evaluation system for it to be successful. If the system relies on data known by teachers to be incorrect, it is not likely to be recognized as legitimate. Rather, our analysis points to the potential for abuse. The roster verification process is far from perfect. The rosters that teachers submit contain many errors that slip past the supervisors responsible for checking them: Students are claimed who should not be; others are unclaimed when it appears from administrative data that they ought to be on someone’s roster. These discrepancies are not neutral. Teachers tend to avoid claiming students whose test performance would lower their value-added scores. This evidence of self-interested behavior on the part of teachers, combined with inadequate oversight by supervisors, raises the prospect of more serious manipulation of roster verification should value added come to be used for highstakes personnel decisions, when incentives to game the system will grow stronger.14 Teachers Monitoring Their Own Students During the Exam Some highly publicized incidents have shown that the use of value-added assessments in high-stakes decisions may lead teachers to cheat. The most egregious forms of cheating involve changing student answer sheets and revealing answers to students. Less attention has been paid to what we suspect is a far more widespread abuse: coaching students during testing. Coaching can take such subtle forms that students, and perhaps even teachers, are not aware that they have overstepped a line. Teachers circulating throughout the room can coach students without saying a word; they have only to read answer sheets and point to questions that students have missed, a practice particularly likely to be effective if the student knows the right answer and has missed the question due to a careless error. March 2015    81

Downloaded from http://er.aera.net at VANDERBILT UNIVERSITY LIBRARY on August 17, 2015

Table 2 Students Monitored by Their Own Classroom Teachers: Effect on Math Scores (1)

(2)

4   5   6   7   8  

(4)

Model 1



Grade

(3)

(5) Model 2

(6) Model 3

Median Percentage of Own Students Supervised

Effect of Being Supervised by Student’s Own Math Teacher

Effect of Being Supervised by Student’s Own Math Teacher

Effect of Being Supervised by Stranger

Effect of Being Supervised by Student’s Own Math Teacher

90

1.32 (.29) 1.46 (.30) 0.70 (.23) 1.15 (.21) 0.97 (.21)

1.46 (.29) 1.49 (.80) 0.55 (.23) 1.13 (.21) 1.02 (.21)

–1.88 (.48) –0.29 (.44) 1.07 (.15) 0.16 (.12) –0.46 (.14)

2.04 (.34) 1.86 (.34) 0.52 (.23) 1.21 (.22) 0.98 (.22)

74 20 21 21

Note. Column 2 contains the percentage of his or her own students that the median math teacher supervises during testing. Column 3 contains coefficients from a regression of spring math scores on an indicator for whether the student was supervised during testing by his or her own math teacher. Model 1 also controls for math score in the prior year and for math teacher fixed effects. Model 2 adds an indicator for whether the student was supervised during testing by a teacher he or she did not have in any of his or her classes. Model 3 contains the following additional covariates not in Model 2: student race/ethnicity, student eligibility for free or reduced-price lunch, size of the test-taking group, gender, native English speakers, full-year enrollees in the school, and age. In addition, special education students and students with multiple math instructors have been dropped from the sample in Column 6.

We are unaware of research documenting the extent of this and similar practices, though informal contacts with students and teachers provide anecdotal evidence that they occur. Popular media have also reported on the problem of teachers coaching students during testing in New York, Pennsylvania, California, and Washington, DC (Baker, 2013; Brown, 2013; Dockterman, 2014; Noguchi, 2013).15 Although we are unable to provide a comprehensive answer here, we test the hypothesis that teachers are more likely to engage in this kind of behavior when they monitor their own students during the exam than when they are assigned to monitor students of other teachers. We test this hypothesis using data on students in Grades 4 through 8 in a single urban system in the South. We report results for mathematics, though we have conducted similar analyses for reading/ language arts. Our data include student scores on tests given in the spring of the 2009–2010 school year as well as prior year scores for these students. For this year we also have administrative records that identify the teacher who monitored the exam. The teacher who provided mathematics instruction has been obtained using the student-teacher linkages established by the roster verification process. Students not claimed in this process have been dropped from the analysis. In the second column of Table 2 we report the percentage of a teacher’s own students that the teacher monitors during the administration of the math test. As one might expect, this percentage tends to be high in the elementary grades, where most instruction as well as testing takes place in self-contained classrooms. Among fourth-grade math teachers in this district, the median percentage is 90.16 This falls to 74% in Grade 5. By Grade 6, departmentalized instruction is the norm, and the

percentage of a math teacher’s own students who take the test under his or her supervision falls sharply. For the median middle school math teacher, it is 20% to 21%. To test our hypothesis, we regress 2009–2010 math scores on prior year scores, a teacher fixed effect, and an indicator for whether the student is monitored by his or her math teacher during the exam. We report coefficients on the last variable in the second column of Table 2. The dependent variable is the raw number right on the exam. Thus, the coefficient represents how many more questions students get right, on average, when their own math teacher monitors the exam. The inclusion of teacher fixed effects means this is a within-teacher comparison: Students monitored by their own teacher are being compared to other students of the same teacher monitored by someone else. Because the great majority of fourth and fifth graders take exams monitored by their own teachers, it is possible that those who do not are exceptional cases (e.g., students who require a special testing environment) who might for other reasons be expected to perform poorly on the exam. This could produce an upward bias in our estimates for Grades 4 and 5. However, bias from the same source is not a plausible factor in Grades 6 through 8, where departmentalized instruction makes it infeasible for most math teachers to administer exams to more than a small fraction of their students.17 The results are striking. At every grade level, the number of questions answered correctly is higher when students are monitored by their own teacher. Except in sixth grade, the improvement is at least one question per student. Because this is the average effect, to find the total number of additional questions answered correctly when students are monitored by their own

82    EDUCATIONAL RESEARCHER Downloaded from http://er.aera.net at VANDERBILT UNIVERSITY LIBRARY on August 17, 2015

teacher, we must multiply by the number of students present for testing. Given that teachers typically monitor upwards of 20 students in a testing session, the aggregate impact on class performance appears to exceed what would occur if teachers only occasionally offered a limited amount of assistance to particularly importunate students. We have conducted similar analyses for teachers of reading/ language arts and science. Although the results are not as strong as for mathematics, in about half the grades we find a statistically significant advantage in favor of students monitored by their instructor. (In reading/language arts, the estimated coefficients for Grades 4 through 8 are −.025, −.085, .711*, .505*, and −.266; in science they are .363, .947*, .645*, .078, and .854* (estimates with asterisks are significant at the 5% level). We do not believe these results indicate that math teachers are more eager to assist their students on tests. Rather, we suspect that interventions that can be used in math (e.g., pointing out a careless error and leaving it to the student to identify and fix the mistake) are harder to apply in other subjects. An alternative interpretation of these findings is that students naturally do better when their own teacher supervises the exam as opposed to a teacher they do not know. We have heard teachers and administrators defend the practice of having teachers monitor their own students during testing on just these grounds: that performance will suffer because an unfamiliar teacher will make students nervous. If this is a valid concern, it would appear to apply principally to younger students in elementary school, not to middle school students who are accustomed to having multiple instructors and who are apt to be given the test by a homeroom teacher or someone else they know from other courses and activities. We have tested this hypothesis by adding an indicator for whether the teacher monitoring the test is one the student knows or is a stranger. (“Strangers” are defined as teachers who do not appear in any of the student’s course records, including homeroom.) Coefficients on this indicator (reported in Column 5) do not fall into a clear pattern, exhibiting a mix of positive and negative signs, although the strongly negative coefficient in Grade 4 is consistent with the hypothesis that a strange teacher has a negative impact on the performance of the youngest students. However, inclusion of this variable makes almost no difference to the estimated effect of being supervised by one’s own math instructor (Column 4). Finally, to control for the possibility that the students monitored by their own math teacher differ systematically from those monitored by someone else, we introduce a variety of controls for student characteristics, such as race, income, and whether the student is a native English speaker. The introduction of these controls makes little difference to our findings. It may be that teachers generally provide assistance to students during testing, whether they have these students in class or not. We have examined only the differential effect of taking a test under the supervision of one’s own math teacher. The effect is positive and both statistically and substantively significant. This is not definitive proof that the effect is the result of coaching. Other explanations are possible, though those we have been able to test have not accounted for our findings. Like our analysis of roster verification, the results suggest that teachers are taking advantage of opportunities presented by the system to

improve their own measured performance, in the one case by dropping from their rosters students who will not harm their value-added scores and in the other by providing assistance to students taking tests under their supervision. Conclusion The USDOE’s RTTT reform initiative required states and school districts to develop educator evaluation systems that, in part, take into account student performance on standardized tests and inform personnel decisions about professional development, compensation, promotion, tenure, and certification. While we regard these evaluation systems as a positive development, a number of problems in their design and implementation need to be addressed. Some of the problems we have described can be more easily corrected than others. Evaluation systems need to find better ways of handling measurement error when conducting summary assessment of educators. New York State has shown the way with a twostep procedure wherein the system first identifies teachers whose estimated value added falls above (or below) some threshold and then, in a second step, decides whether further action is warranted given the precision of the estimate. Likewise, if (as it appears) teachers monitoring their own students are more likely to coach them during the test, there would seem to be an obvious remedy: Someone else should monitor them (or at least be in the room).18 It is likely that roster verification procedures can be improved. However, the complexity of these procedures mitigates against complete accuracy, particularly when teachers are asked not simply to verify which students they taught but to drop from their rosters students who are not to be used in computing value-added scores. The more complicated the criteria determining which students do and do not count, the harder it is for principals and other supervisors to check compliance. The problem may grow worse as the demands on these evaluation systems increase. In New York State, for example, plans are in place to use fractional linkages in the calculation of individual teacher value added; that is, districts will provide the state with fine-grained instructional linkage time so that value-added estimates can be weighted by the amount of instructional time provided by each teacher. As fractional linkages become a part of teacher-level value-added estimates, the rate of linkage failures is likely to increase.19 From a practical standpoint, the most difficult problems to solve may be those posed by the practice of revising estimates of teacher effectiveness as more recent data become available. It is too late to pretend such estimates don’t exist: The genie is out of the bottle. Likewise, a state that made a practice of issuing revised “improved” estimates would appear to be in a poor position to argue that highstakes decisions ought to be based on initial, unrevised estimates, though in fact the grounds for regarding the revised estimates as an improvement are sometimes highly dubious. There is no obvious fix for this problem, which we expect will be fought out in the courts. To conclude, states are installing educator evaluation systems that, in part, rely on student test scores to measure teacher performance. Given that such systems are only as good as the data available to them and the policies and procedures that govern them, it is clear that states need to better address the outlined underappreciated problems if they are going to become a permanent part of the K–12 landscape. March 2015    83

Downloaded from http://er.aera.net at VANDERBILT UNIVERSITY LIBRARY on August 17, 2015

9

Notes

We appreciate helpful comments and suggestions from Henry Braun, Doug Harris, Carolyn Herrington, and two anonymous referees. We would also like to acknowledge the many individuals who offered expert insight on teacher evaluation systems in Race to the Top states and Jason Spector for excellent research assistance. Any errors remain the sole responsibility of the authors. The views expressed in this article do not necessarily reflect those of sponsoring agencies or individuals acknowledged. 1 One might wonder if the teacher with the smaller class makes up for having fewer students by teaching more subjects to them; thus, her performance measure might average math, reading/English language arts, and even science scores across her 25 students, while the second teacher has scores only in a single subject. Because the scores of a given student are highly correlated across subjects, this is a false hope: Including more scores for the same set of students does not reduce noise in the same way as averaging across a larger number of students. 2 Indeed, the top and bottom categories of the North Carolina evaluation system correspond to teachers deemed more effective than average and less effective than average with 95% confidence. 3 To our knowledge, only New York has adopted a staggered, twostep test of this kind. 4 The reason for the revisions is that Educational Value-Added Assessment System (EVAAS) is based on a “layered model,” in which a student’s score at any point in time represents the cumulative impact of past teachers. Thus, a fifth grader’s test score in year t is represented by the following equation: SCORE(t) = Grade 5 Value-Added(t) + Grade 4 Value-Added(t – 1) + Grade 3 Value-added(t – 2) + District Effect(t) + Idiosyncratic Error(t). EVAAS yields estimates of the first four components of this expression when the model is estimated in year t. Thus the fourth grade teacher’s value added for her cohort of students in t – 1, which was estimated once already in year t – 1, is reestimated in year t when that cohort reaches fifth grade, and yet again in year t + 1 when they reach sixth grade. Indeed, further revisions could be obtained from years t + 2 and t + 3 as these students reach seventh and eighth grades, but EVAAS reports results as 3-year moving averages, using only a teacher’s most recent three cohorts. 5 The assumption underlying EVAAS is that a teacher affects growth only during the year the teacher has a student; future scores are used only to get a better fix on how much that growth was. If teachers have an effect on how much students learn in the future (as we believe is at least sometimes the case, e.g., with transformational teachers), a different, more complicated, model is required. 6 Additional test scores may be available as students complete high school end-of-course exams. But these do not appear in a regular manner each year (e.g., an Algebra I test taken when the student completes that course), and to our knowledge EVAAS has never attempted to combine such information with the results of standardized testing conducted on a regular, annual basis in core subjects in lower grades when estimating value added of elementary and middle school teachers. 7 It might be argued that the solution is to determine how long to wait for additional data, taking into account the costs of delaying decisions, and then simply follow this rule, revising value-added estimates in accordance with it. We believe this would be utterly infeasible. At a minimum, the optimal waiting period will vary by grade taught. It should also vary based on the initial value-added estimate: for example, the worse the initial score, the costlier it will be to wait. We doubt a system of such complexity, treating teachers differently (even if “optimally”), would be regarded as politically viable or as likely to survive a legal challenge. 8 On this issue see Han, McCaffrey, Springer, and Gottfried (2012); Isenberg, Teh, and Walsh (2013); and Kluender, Thorn, and Watson (2011).

We have visited state Department of Education websites of all Race to the Top (RTTT) states: Arizona, Colorado, Delaware, Florida, Georgia, Hawaii, Illinois, Kentucky, Louisiana, Maryland, Massachusetts, New Jersey, New York, North Carolina, Ohio, Pennsylvania, Rhode Island, and Tennessee, as well as the District of Columbia. We found explicit reference to roster verification procedures involving teachers for Colorado, Florida, Georgia, Hawaii, Kentucky, Louisiana, New York, North Carolina, Ohio, Pennsylvania, Rhode Island, Tennessee, and the District of Columbia. See, for example, Tyler (2011), PVAAS Statewide Core Team for PDE (2013), Center for Educational Leadership and Technology (n.d.), Rhode Island Department of Education (2013), and Battelle for Kids (2012). Some states leave these procedures up to local education agencies (Illinois, Arizona). 10 We are aware of two other studies of roster verification. Isenberg et al. (2013) investigate discrepancies between unconfirmed and confirmed rosters in upper elementary grades in the District of Columbia public schools. They attribute the discrepancies primarily to the use of departmentalized instruction in these grades not reflected in enrollment records. Their focus is therefore on which teacher claims a student rather than on whether any teacher claims a student, and they do not investigate the performance of unclaimed students. Kluender et al. (2011) is a qualitative study describing some of the challenges to establishing reliable student-teacher linkages. 11 The exact rule is that students count only if they are expected to be in the class at least 150 days of the school year (75 days for semesterlength classes). Days absent are subtracted in arriving at these totals. Since testing occurs in April, this obviously requires some forecasting on the part of teachers. We operationalized this rule by requiring students to have been in class at least 120 days by the test date (60 days for semester-length classes). Our results are not sensitive to small changes (e.g., adding or subtracting 10 days from these figures), as small numbers of students are near enough these cutpoints to be affected. 12 We have conducted a parallel analysis of scores in reading/ English language arts. Results are very similar. 13 The 643 students in the first row of Column 1 are assigned to 407 different teachers. Although the former figure is a fraction of 1% of the student sample used in the analysis, the 407 teachers represent 3% of the mathematics instructors in the sample. The 66,004 students in the second row of Column 1 are taught by 8,120 different instructors— more than 62% of the mathematics instructors in the sample. It is not the case that these discrepancies arise in only a small subset of the sample. 14 In the state we studied, plans were announced to use value-added assessments as part of a personnel evaluation system that would determine, among other things, whether veteran teachers’ licenses would be renewed. The state subsequently backed off this plan, and no such stakes were in place during the 2012–2013 school year. The state plans to use the evaluation system to determine compensation, but this change lies in the future. 15 This problem is not limited to the United States. Similar reports have surfaced about Australia’s Naplan tests (McDougall, 2014). 16 Even in fourth grade, however, there is considerable variation. For example, the teacher at the 25th percentile monitors only 34% of his or her students. More than 10% of fourth-grade math teachers do not monitor any of their students during the exam. 17 Special education students were exempt from the claiming process and have already been dropped from the analysis unless mistakenly claimed by a teacher. As an additional check, we have excluded these remaining special education students from the sample. The number of such students is small, and the results are barely affected by their exclusion. 18 In our review of RTTT state testing protocols, we did not find a single state that prohibited classroom teachers from administering state assessments to their own students provided none of the students was a relative.

84    EDUCATIONAL RESEARCHER Downloaded from http://er.aera.net at VANDERBILT UNIVERSITY LIBRARY on August 17, 2015

19

In a technical report prepared for New York State’s educator evaluation growth model, 17% of students with valid test data were not linked to teachers for the required duration of time (American Institutes for Research, 2012). This means that more than 328,000 students in Grades 4 to 8 have fractional linkages that will need to be tracked and incorporated into the state’s educator evaluation growth model in the future. References

American Institutes for Research. (2012). 2011-2012 growth model for educator evaluation technical report: Final. Washington, DC: Author. Retrieved from https://www.engageny.org/file/8806/ download/growth-model-11-12-air-technical-report.pdf?token=Jn S511OVjAYVF4Lgqwdy3IsZlA6LOIxrMj0LZ-r66tk American Statistical Association. (2014). ASA statement on using valueadded models for educational assessment. Retrieved from http:// www.amstat.org/policy/pdfs/ASA_VAM_Statement.pdf Baker, A. (2013). Allegations of test help by teachers. The New York Times. Retrieved from http://www.nytimes.com/2013/04/12/ education/long-island-educators-under-inquiry-for-test-help .html?_r=2& Ballou, D., Sanders, W., & Wright, P. (2004). Controlling for student background in value-added assessment of teachers. Journal of Educational and Behavioral Statistics, 29(1), 37–65. Battelle for Kids. (2012). Roster verification using BFKLink (presented to Colorado Department of Education). Retrieved from http:// www.tsdl.org/resources/site1/general/White%20Papers/TSDL_ RosterVerificationConceptPaper.pdf. Betebenner, D. W. (2008). A primer on student growth percentiles. Dover, NH: National Center for the Improvement of Educational Assessment. Betebenner, D. W. (2011). A technical overview of the student growth percentile methodology: Student growth percentiles and percentile growth projections/trajectories. Dover, NH: National Center for the Improvement of Educational Assessment. Brown, E. (2013). D.C. report: Teachers in 18 classrooms cheated on students’ high-stakes tests in 2012. The Washington Post. Retrieved from http://www.washingtonpost.com/local/education/memo-couldrevive-allegations-of-cheating-in-dc-public-schools/2013/04/12/9 ddb2bb6-a35e-11e2-9c03-6952ff305f35_story.html Center for Educational Leadership and Technology. (n.d.). Kentucky final report. Retrieved from http://www.tsdl.org/resources/site1/ general/White%20Papers/TSDL_KentuckyFinalReport.pdf Cullen, J. B., & Reback, R. (2006). Tinkering toward accolades: School gaming under a performance accountability system (NBER Working Paper #12286). Cambridge, MA: National Bureau for Economic Research. Deere, D., & Strayer, W. (2001). Putting schools to the test: School accountability, incentives, and behavior. Unpublished manuscript, Texas A&M University. Dockterman, E. (2014). Philadelphia educators indicted for helping students cheat. Time. Retrieved from http://time.com/93196/philadelophia-educators-charged-with-helping-students-cheat/ Figlio, D. (2003). Testing, crime and punishment. Unpublished manuscript, University of Florida. Figlio, D., & Getzler, L. (2002). Accountability, ability and disability: Gaming the system? (National Bureau for Economic Research Working Paper 9307). Cambridge, MA: National Bureau for Economic Research. Figlio, D., & Winicki, J. (2005). Food for thought? The effects of school accountability plans on school nutrition. Journal of Public Economics, 89, 381–394. Georgia Department of Education. (2013). Teacher keys effectiveness system. Office of School Improvement, Teacher and Leader Effectiveness Division.

Retrieved from http://www.gadoe.org/School-Improvement/ Teacher-and-Leader-Effectiveness/Documents/TKES%20Hand book%20FINAL%207-18-2013.pdf Goodnough, A. (1999, December 8). Answers allegedly supplied in effort to raise test scores. New York Times. Han, B., McCaffrey, D. F., Springer, M., & Gottfried, M. (2012). Teacher effect estimates and decision rules for establishing student-teacher linkages: What are the implications for high-stakes personnel policies in an urban school district? Statistics, Politics, and Policy, 3(2), 1–22. Haney, W. (2000). The myth of the Texas miracle in education. Education Analysis Policy Archives, 8(41). Retrieved from http:// epaa.asu.edu/epaa/v8n41/. Harris, D. N. (2011). Value-added measures in education. Cambridge, MA: Harvard Education Press. Isenberg, E., Teh, B., & Walsh, E. (2013). Elementary school data issues: Implications for research using value-added models (Mathematica Policy Research Working Paper). Retrieved from http://www. mathematica-mpr.com/~/media/publications/pdfs/education/ elementary_school_data_issues.pdf Jacob, B. (2005). Testing, accountability, and incentives: The impact of high-stakes testing in Chicago public schools. Journal of Public Economics, 89(5/6), 761–796. Jacob, B., & Levitt, S. (2005). Rotten apples: An investigation of the prevalence and predictors of teacher cheating. Quarterly Journal of Economics, 118(3), 843–877. Kluender, R., Thorn, C., & Watson, J. (2011). Why are student-teacher linkages important? An introduction to data quality concerns and solutions in the context of classroom-level performance measures. Madison, WI: Center for Educator Compensation Reform. Koretz, D., & Barron, S. I. (1998). The validity of gains on the Kentucky Instructional Results Information System (KIRIS) (M-1014-EDU). Santa Monica, CA: RAND Corporation. McCaffrey, D. F., Lockwood, J. R., Koretz, D. M., & Hamilton, L. S. (2003). Evaluating value-added models for teacher accountability. Santa Monica, CA: Rand. McDougall, B. (2014). Principals and teachers banned from coaching Naplan tests. The Advertiser. Retrieved from http://www.adelaidenow.com.au/news/principals-and-teachers-banned-from-coachingnaplan-tests/story-fn3o6nna-1226865821078 Noguchi, S. (2013). San Jose teacher helped second-graders cheat on STAR test. San Jose Mercury News. Retrieved from http://www .mercurynews.com/central-coast/ci_24183649/san-jose-teacherhelped-second-graders-cheat-star Peabody, Z., & Markley, M. (2003, June 14). State may lower HISD rating; Almost 3,000 dropouts miscounted, report says. Houston Chronicle, p. A1. PVAAS Statewide Core Team for PDE. (2013). SY2013-14 statewide implementation: Pennsylvania value-added assessment system PVAAS teacher specific reporting guide to implementation 2013. Retrieved from http://www.cliu.org/cms/lib05/PA06001162/ Centricity/Domain/21/SY1314%20Implementation%20 Guide%20%20October%202013.pdf RANDA Solutions. (2014). New USDOE report reveals many states struggle to link individual student test data to the proper teacher. Retrieved from www.prweb.com/releases/2014/02/ prweb11562558.htm Rhode Island Department of Education. (2013). Roster verification 2012–13. A user’s guide for teachers. Retrieved from http://www .eride.ri.gov/RosterVerification/UserGuide_Teacher.pdf SAS Institute. (2014). SAS EVAAS for K–12. Retrieved from http:// www.sas.com/content/dam/SAS/en_us/doc/productbrief/sasevaas-k12-104570.pdf March 2015    85

Downloaded from http://er.aera.net at VANDERBILT UNIVERSITY LIBRARY on August 17, 2015

Sawchuk, S. (2014). AFT’s Weingarten backtracks on using valueadded measures for evaluation. Education Week. Retrieved from http://blogs.edweek.org/edweek/teacherbeat/2014/01/weingartens_retrenchment_on_va.html Tyler, O. S. (2011). Curriculum verification and results (CVR) reporting portal implementation guide. Louisiana Department of Education. Retrieved from http://www.sas.com/en_us/industry/k-12-education/evaas.html United States Department of Education. (2009). Race to the top program. Washington, DC: Author. Retrieved from https://www2. ed.gov/programs/racetothetop/executive-summary.pdf United States Department of Education. (2014). Race to the top. Washington, DC: Author. Retrieved from http://www.whitehouse.gov/issues/education/k-12/race-to-the-top. Webber, A., Troppe, P., Milanowski, A., Gutmann, B., Reisner, E., & Goertz, M. (2014). State implementation of reforms promoted under the Recovery Act. Washington, DC: U.S. Department of Education, Institute of Education Sciences.

Authors

Appendix: Illustrative Calculations of the Student Growth Component of Georgia’s Teacher Effectiveness Measure

of 0 and variance σ2. Then the probability that a teacher is rated ineffective is

Teacher A has a class of 25, teacher B a class of 100. The true mean student growth percentile (TSGP) for both teachers is 40, in the sense that this is the mean percentile we would observe if the two teachers were assigned students of average ability whose performance was measured without error. The measured mean growth percentiles for these teachers are MSGPA and MSGPB, respectively. The relationship of MSGPi to TSGPi is MSGPi = TSGPi + (1 / Ni ) Σ j u ij , where the uij are student-level errors reflecting good or bad luck in teaching assignments and test measurement error and Ni is number of tested students. We assume that in the population these teachers serve, the uij are normally distributed with mean

DALE BALLOU, PhD, is an associate professor of public policy and education at Vanderbilt University, 141 Wyatt Center, Peabody No. 14, Nashville, TN 37203-5701; [email protected]. His research focuses on incentives, accountability, and teacher performance. MATTHEW G. SPRINGER, PhD, is an assistant professor of public policy and education at Peabody College of Vanderbilt University and the director of the National Center on Performance Incentives, Peabody No. 43, 230 Appleton Place, Nashville, TN 37203-5701; matthew.g.springer@ vanderbilt.edu; Twitter:@eduspringer. His research focuses on incentives, accountability, and compensation. Manuscript received July 7, 2014 Revisions received December 10, 2014, and February 3, 2015 Accepted February 5, 2015

(

(

Prob ( MSGPi < 30 ) = Prob (1 / σ ) √ Ni

(1 / σ ) ( √ Ni ) ( 30

− 40 )

) ( MSGPi − 40 )