observation on unsupervised purity and supervised result

dulu...@gmail.com

unread,

Apr 16, 2007, 12:39:47 PM4/16/07

to senseinduction

Greetings all,

I've noticed that the purity values for my the unsupervised results
track rather closely with the supervised results. They are within .01
for all, nouns, and verbs, and I'm wondering if that is a general
characteristic of the supervised measure. This was based on computing
purity on the test portion of the data, btw.

Will the purity and entropy values for all systems be made available
at some point as well?

BTW, I ran the unsupervised evaluation using all 27,132 instances and
the overall f-score dropped a few points, perhaps .04. I noticed as
well that the class average shifted rather dramatically, from 3.68
(using all 27132 instances) to 2.87 when using just the test.

Cordially,
Ted

Aitor Soroa Etxabe

unread,

Apr 17, 2007, 4:04:19 AM4/17/07

to sensein...@googlegroups.com

Hi,

on 2007/04/16, tped...@d.umn.edu-k wrote:
> [...]

>
> Will the purity and entropy values for all systems be made available
> at some point as well?

Yes, we plan to include the purity and entropy values in the task description
paper.

Best,
aitor

dulu...@gmail.com

unread,

Apr 19, 2007, 12:21:38 PM4/19/07

to senseinduction

Hi Aitor,

Would it be possible to make available an anonymous listing of the
fscore, purity and entropy values
for all systems for the unsupervised evaluation on the test data as
well as the train+test?

Both of the above would allow for adding a few analysis points in the
final version of our system paper.

Thanks,
Ted

On Apr 17, 3:04 am, Aitor Soroa Etxabe <a.so...@gmail.com> wrote:
> Hi,

Aitor Soroa Etxabe

unread,

Apr 20, 2007, 4:23:51 AM4/20/07

to sensein...@googlegroups.com

Hi Ted,

here is a table with the unsupervised evaluation (anonymized)

\begin{table*}[ht]
\centering
\begin{tabular}{|l|c|ccc|cc|}
\hline \hline
System & Rank & \multicolumn{3}{c}{All} & Nouns & Verbs\\
& & Fscore & Purity & Entropy & Fscore & Fscore\\
\hline
1clusterPerWord & 1 & \textbf{78.9} & 79.8 & 45.4 & 80.7 & 76.8 \\
-- & 2 & \textbf{78.7} & 80.5 & 43.8 & 80.8 & 76.3 \\
-- & 3 & \textbf{66.3} & 83.8 & 33.2 & 69.9 & 62.2 \\
-- & 4 & \textbf{66.1} & 81.7 & 40.5 & 67.1 & 65.0 \\
-- & 5 & \textbf{63.9} & 84.0 & 32.8 & 68.0 & 59.3 \\
-- & 6 & \textbf{61.5} & 82.2 & 37.8 & 62.3 & 60.5 \\
-- & 7 & \textbf{56.1} & 86.1 & 27.1 & 65.8 & 45.1\\
Random & 8 & \textbf{37.9} & 86.1& 27.7 & 38.08 & 37.66\\
1clusterPerInst & 9 & \textbf{9.5} & 100 & 0 & 6.6 & 12.7 \\
\hline \hline
\end{tabular}
\caption{Unsupervised evaluation on the test corpus (Fscore).}
\label{tab:unsup-eval}
\end{table*}

The entropy and purity values of the clustering solutions over the test
corpus are also present, but not further splited in verbs/nouns.

Best,
aitor

--
ondo izan
aitor
Pgp id: 0x5D6070F2

Ted Pedersen

unread,

Apr 20, 2007, 10:32:00 AM4/20/07

to sensein...@googlegroups.com

Hi Aitor,

Thank you, this is really interesting. Do you think it would be
possible to provide
the same sort of table for the test+train data (27,132 instances)? I
don't think the
results will change that much, but they will be a bit different I
think. One significant
difference, I think, from the test and train+test data is the average number of
classes ("true" clusters") that goes up from 2.something to 3.something, and I
guess that could have some impact on all of these measures.

The thing that I can see from this table is that my earlier
speculations about some
correlation between the supervised measure and purity are probably not
true. :) I
will keep looking at that, but I think that was just something that
happened with my
results and perhaps not anyone elses (so maybe nobody understood what in the
world I was thinking ... :)

Among the more interesting things I see in the table below is random!!???

Fscore 37.9
Purity 86.1
Entropy 27.7

These numbers present themselves as the classic sort of argument for "well, my
system did not find the same senses as exist in the data (due to low
f-score), but
it found relatively clean clusters (high purity and low entropy) so
whatever it did
was probably pretty good."

Except that it's random. :)

How many clusters are in the random results? It would seem like if you
had a largish
number of clusters, but not too many, you might get numbers like this.
For example,
1clusterPerInstance gets purity of 100 and entropy of 0, which looks pretty good
except that you have 4,000+ clusters.

I think it "might" be interesting to know the average number of
clusters that systems
found. I think I reported that in our system paper, and i'd encourage
others to do the
same as it can add a useful data point to trying to figure out some of
these scores.
And of course including that in a summary table could in the task
paper could be added
to my endless list of requests. :)

In any case, thanks again for the very interesting table, and sorry
for the additional
requests, etc. that come from that.