Recently, Old Dominion University embarked on an initiative to improve the teaching of disciplinary writing across courses university-wide. This is part of ODU’s Quality Enhancement Plan, an effort to improve undergraduate instruction in general. It’s an extensive program, involving extra instructional training and internal grant competitions, among other initiatives.
Writing quality is one of the best indicators of deep understanding of subject matter, and the feedback on that writing is among the most valuable content an instructor can provide. Unfortunately, large class sizes have resulted in a shift of responsibility for grading writing from the faculty teaching courses to the graduate teaching assistants supporting them in those courses. Or more plainly, when you have 150 students in a single class, there’s simply no way to reasonably provide detailed writing feedback by yourself several times in a semester on top of the other duties required of instructors without working substantial overtime.
With that in the background, I was pleasantly surprised to discover a new paper by Doe, Gingerich and Richards appearing in the most recent issue of Teaching of Psychology on the training of graduate student teaching assistants in evaluating writing. In their study, they compared 12 GTA graders with a 5-week course on “theory and best practices” for grading with 8 professional graders over time at two time points, about 3 months apart. For this study, the professional graders were considered to be a “gold standard” for grading quality.
Overall, the researchers found that GTAs were more lenient than professional graders at both time points. Both groups provided higher grades at Time 2 than at Time 1. This is most likely due to student learning over the course of the semester, it might be due to variance in assignment difficulty – the researchers did not describe any attempts to account for this. Professional graders assessed papers blind to time point, ruling out a purely temporal effect.
The researchers also found a significant interaction between GTA grader identity and time, indicating different changes over time between graders, so GTAs were not homogenous, suggesting important individual difference moderators (perhaps GTAs become harsher or more lenient at different rates?). Student comment quality was also assessed by the researchers, finding increases in comment quality over time among student raters, but with differences across dimensions of comment quality (e.g. discussion of strengths and weaknesses, rhetorical concerns, positive feedback, etc). The relative magnitudes of increases were not examined, although these could be computed from the provided table.
One major issue with this study is that the reliabilities of the comment quality outcomes are quite low. Correlations between raters ranged from .69 to .92, but both raters only assessed 10% of the papers. When calculating correlations as an estimate of inter-rater reliability, the use of these correlations as estimates of inter-rater reliabiltiy assumes that everyone rates every paper and the mean is used. Since most papers were assessed by only one rater, these reliabilities are overestimates and should have been corrected down using the Spearman-Brown prophecy formula. Having said that, the effect of low reliability is that observed relationships are attenuated – that is, they are smaller in the dataset than they should be. So had this been done correctly, the increased accurate would have made the researchers’ results stronger. The effect of time (and of grader) may be much larger than indicated here.
Overall, the researcher conclude that the assumption of GTA quality when grading writing assignments is misplaced. Even with training and practice, GTA performance did not reach that of professional graders. On the bright side, training did help. The researcher conclude that continuous training is necessary to ensure high quality grading – a one-time training – or I suspect, more commonly, no training – is insufficient.
One thing that was not assessed by the researchers was a comparison of GTA comment quality and professional comment quality versus professor comment quality. But perhaps that would hit a bit too close to home!Footnotes: