Just wanted to share some evaluation runs I did that I found sort of
interesting...
I was able to reproduce my score on the supervised evaluation...Just
for the record I created the "mapping" like this...
 perl create_supervised_keyfile.pl -O SuperKey
../keys/senseinduction.key ../keys/senseinduction_train.key >
superanswers.txt
and then I evaluated my answer file (../Submitted/...) like this...
sup_eval.sh ../Submitted/12.new.answers.txt SuperKey
The results correspond with what you provided which makes me think I
managed to do it correctly...I won't show the numbers here to keep
things anonymous....
Then I decided it would be interesting to see the result of evaluating
the official gold standard key file with the supervised
methodology....I had expected to see precision and recall of 100%, and
it is nearly that but slightly off....
marimba(270): sup_eval.sh ../keys/senseinduction.key SuperKey
All words
Fine-grained score for "SuperKey/senseinduction.all.suppervised.key"
using key "../keys/senseinduction_test.key":
 precision: 0.997 (4837.00 correct of 4851.00 attempted)
 recall: 0.997 (4837.00 correct of 4851.00 in total)
 attempted: 100.00 % (4851.00 attempted of 4851.00 in total)
Nouns
Fine-grained score for "SuperKey/senseinduction.noun.suppervised.key"
using key "../keys/senseinduction_test.key":
 precision: 0.998 (2553.00 correct of 2559.00 attempted)
 recall: 0.526 (2553.00 correct of 4851.00 in total)
 attempted: 52.75 % (2559.00 attempted of 4851.00 in total)
Verbs
Fine-grained score for "SuperKey/senseinduction.verb.suppervised.key"
using key "../keys/senseinduction_test.key":
 precision: 0.997 (2284.00 correct of 2292.00 attempted)
 recall: 0.471 (2284.00 correct of 4851.00 in total)
 attempted: 47.25 % (2292.00 attempted of 4851.00 in total)
Is there any reason why what I attemped would not result in 100%
precision and recall?
There are apparently 14 instances (6 nouns and 8 verbs) that have gone
missing in the
supervised evaluation? Obviously this is a very very small number of
instances, but it wasn't immediately clear to me why the supervised
evaluation on the gold standard key wouldn't result in 100%, so I
thought it worth mentioning.
Now, I had noticed that superanswers.txt (saved above from standard
output) had 4851 instances in it, so I decided to run scorer2 on that
data relative to the gold standard test key.
When I did that, I got the same precision/recall as I got from the
sup_eval.script, although the number of attempted instances was 4851,
which suggests that there were a few errors in the key file created by
create_supervised_keyfile.pl that then looked like (maybe?) missing
instances in the sup_eval.sh script.... ??
marimba(295): scorer2 ../keys/senseinduction_test.key superanswers.txt
Fine-grained score for "../keys/senseinduction_test.key" using key
"superanswers.txt":
 precision: 0.997 (4837.00 correct of 4851.00 attempted)
 recall: 0.997 (4837.00 correct of 4851.00 in total)
 attempted: 100.00 % (4851.00 attempted of 4851.00 in total)
So, I guess the question might be if create_supervised_key.pl would
make "mistakes" in creating the key file, and if those mistakes would
show up as missing instances in the sup_eval.sh process? Does that
tell us anything about the data, evaluation, etc. that is of interest?
Thanks,
Ted
-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse
marimba(310): diff superanswers.txt ../keys/senseinduction_test.key
615,620c615,620
< carrier.n carrier.n.117 carrier.n.10
< carrier.n carrier.n.118 carrier.n.10
< carrier.n carrier.n.119 carrier.n.10
< carrier.n carrier.n.120 carrier.n.10
< carrier.n carrier.n.121 carrier.n.10
< carrier.n carrier.n.122 carrier.n.10
---
> carrier.n carrier.n.117 carrier.n.5
> carrier.n carrier.n.118 carrier.n.5
> carrier.n carrier.n.119 carrier.n.5
> carrier.n carrier.n.120 carrier.n.5
> carrier.n carrier.n.121 carrier.n.5
> carrier.n carrier.n.122 carrier.n.5
915c915
< disclose.v disclose.v.69 disclose.v.1
---
> disclose.v disclose.v.69 disclose.v.2
1526c1526
< hold.v hold.v.145 hold.v.1
---
> hold.v hold.v.145 hold.v.7
1747c1747
< keep.v keep.v.319 keep.v.1
---
> keep.v keep.v.319 keep.v.7
2058c2058
< occur.v occur.v.60 occur.v.1
---
> occur.v occur.v.60 occur.v.3
2886,2887c2886,2887
< produce.v produce.v.151 produce.v.1
< produce.v produce.v.152 produce.v.1
---
> produce.v produce.v.151 produce.v.3
> produce.v produce.v.152 produce.v.3
2963c2963
< raise.v raise.v.157 raise.v.1
---
> raise.v raise.v.157 raise.v.9
4825c4825
< work.v work.v.247 work.v.1
---
> work.v work.v.247 work.v.6
> I was able to reproduce my score on the supervised evaluation...Just
> for the record I created the "mapping" like this...
> 
>  perl create_supervised_keyfile.pl -O SuperKey ../keys/senseinduction.key
>  ../keys/senseinduction_train.key > superanswers.txt
Actually, you must pass your clustering solution as the first parameter to
create_supervised_keyfile.pl, i.e.,
perl create_supervised_keyfile.pl ../Submitted/12.new ../keys/senseinduction_train.key > superanswers.txt
The -O switch of create_supervised_keyfile.pl script indicated a directory
for leaving the mapping matrices btw. clusters and senses (one file per word
with .c2s extension), and is used just for debuging purposes.
Once the supervised keyfile is created (superanswers.txt), you can evaluate it with:
./scorer2 superanswers.txt ../keys/senseinduction_test.key
> Then I decided it would be interesting to see the result of evaluating the
> official gold standard key file with the supervised methodology....I had
> expected to see precision and recall of 100%, and it is nearly that but
> slightly off....
>
> [...]
> 
> Is there any reason why what I attemped would not result in 100% precision
> and recall?  There are apparently 14 instances (6 nouns and 8 verbs) that
> have gone missing in the supervised evaluation? Obviously this is a very
> very small number of instances, but it wasn't immediately clear to me why
> the supervised evaluation on the gold standard key wouldn't result in
> 100%, so I thought it worth mentioning.
There is always some information loss when a mapping step is required. In
this case the differences can be explained because some sense tags appear in
the test corpus but not in the training part (for example, carrier.n.5
sense). As we use the train corpus for mapping between clusters and senses,
the information about those senses not appearing in the train corpus is
lost.
 
Best,
				aitor