The discussion about reliability is very interesting.
Here I have a related problem and I hope stat-ler can offer some help.
The following data is from a writing test:
Obv RaterID1 RaterID2 Score1 Score2 Score
1 1 2 3 4 3.5
2 1 2 5 6 5.5
.......
20 1 3 4 5 4.5
21 1 3 3 2 2.5
.......
44 2 1 4 4 4.0
45 2 1 3 3 4.0
.......
There were more than 40,000 papers and about 40 raters. The final score is
the average of two unless the difference between them is more than 1. In that
case a more experienced rater will determine the score. For the sake of
simplicity the third rater is ignored.
The reliability is calculated by GENOVA which output something like this:
RaterID1 RaterID2 Reliabilty
1 2 0.67
1 3 0.66
.............
2 1 0.66
2 3 0.77
.............
The above Inter-Rater reliability is calculated by VAR(PARTICIPANT)/VAR(TOTAL).
This is similar to the problem discussed.
Now I am interested in the quality of the Raters. That is, I would like
to give each rater a score for their performance. The reliability calculated
above cannot be used because it is related to TWO raters. My question is:
IS IT POSSIBLE TO DO SOME ANALYSIS ON RATER PERFORMANCE WITH DATA LISTED
ABOVE? Is there any reference?
BTW, right now in order to assess Rater quality a sample of papers is
selected and raterd by each rater. In addition, several experts also do the
scoring to create a Master score. That is
Rater MasterScore RaterScore
1 4 4
1 5 4
........
2 4 5
2 5 4
........
And Master-Rater Correlation or VAR(paper)/VAR(TOTAL) for each rater is then
used as indicator of Rater performance. However this procedure is not very
good because
(1) The extra work involved.
(2) The experts are not perfect.
(3) Small sample size.
(4) The assessment is done during training which may not reflect rater's
performance in the actual scoring.
Any suggestion is welcome.
Thanks!
Zhuan XU
American College Testing