Files
Abstract
The purpose of this study was to investigate the processes and procedures impacting the validation and the establishment of inter-rater reliability of observation instruments used in a teacher evaluation context. A newly-created observation instrument used in an extensive teacher evaluation program in a southeastern school district served as the object of investigation in this research. Inter-rater reliability coefficients of the instrument were assessed as part of the validation of the evaluation system. Additionally, two methods of inter-rater reliability that correct for chance agreement were examined to determine if the Gwet AC1 statistic, which is often used in a medical context but rarely in an education one, outperformed the typically provided kappa statistic. The inter-rater reliability coefficients for all videos and all items combined were in an acceptable range. This was also the case for most individual standards as well. Gwets AC1 statistic regularly outperformed the kappa statistic as a chance-corrected measure of inter-rater reliability. This finding held for all teachers combined, for the highest-rated teacher, and for the lowest-rated teacher, suggesting that Gwets AC1 statistic shows promise for future inter-rater reliability studies in a teacher evaluation context. While Gwets AC1 statistic outperformed kappa for the lowest-rated teacher, what was clear is the inter-rater reliability coefficients for the lowest-rated teacher suggests that consistently and accurately identifying poorly performing teachers is elusive. Additionally, this finding suggests the possibility that standards by which teachers are traditionally assessed enabling accurate identification for poor performing teachers may be underdeveloped. Further research in this area is warranted.