contributing to nltk.metrics.agreement

David Doukhan

unread,

Feb 17, 2012, 12:32:09 PM2/17/12

to nltk...@googlegroups.com

Hi,

Here are some more details concerning my appliance to nltk-dev. Please
let me know if those points were already discussed in the list.
I'm currently willing to use inter-anotator agreement measures,
applied to text segmentation tasks.
I already used the kappa and the WindowDiff provided in this module,
and I'm willing to use two other metrics:

* Generalized Hamming Distance, which has been shown to be a robust
metric for text segmentation (see begsten 2009: Quel indice pour
mesurer l'efficacité en segmentation de textes? sorry it's in french)
A C++ implementation of this measure was provided by original authors
: http://digital.cs.usu.edu/~vkulyukin/vkweb/software/ghd/ghd.html
I think this implementation could be wrapped quite easily

* the Pk measure intruduced by Beeferman in Statistical models for
text segmentation (1999).
While the WindowDiff measure provided in NLTK was shown to be a better
estimator, the Pk was a standard for years, and knowing its value may
help to compare results to other works.

Since I need these tools for my own work, I'm proposing to add them to
NLTK, to allow people working on similar problems to have access to
these procedures, without having to re-implement them.
So let me know your point of view about these points. I would also
enjoy any pointer on existing implementations of the Pk measure I
could use as a reference or wrapp. While this measure was very used in
the litterature, i did not find any public implementation of it.

Regards,

--
David Doukhan

Joel Nothman

unread,

Feb 18, 2012, 6:30:43 AM2/18/12

to nltk...@googlegroups.com, David Doukhan

Hi David,

I don't know about the Pk metric (or segmentation), but it seems to be an
evaluation metric, not an inter-annotator agreement metric. The metrics
provided in nltk.metrics.agreement are chance-corrected agreement metrics:
they aim to discount basic agreement (or another metric for comparing
annotations) by the portion that can be accounted for by chance. They are
only used when comparing human annotators.

Perhaps your contributions belong in nltk.metrics.scores, or even better,
in nltk.metrics.segmentation.

Thanks for helping out!

- Joel

Joel Nothman

unread,

Feb 22, 2012, 8:18:47 AM2/22/12

to David Doukhan, nltk-dev

On Wed, 22 Feb 2012 23:46:02 +1100, David Doukhan
<david....@gmail.com> wrote:
> 2012/2/18 Joel Nothman <jnot...@student.usyd.edu.au>:

>> I don't know about the Pk metric (or segmentation), but it seems to be
>> an evaluation metric, not an inter-annotator agreement metric.
>

> That's true.
> However, the Pk, as well as the WindowDiff measure provided in NLTK,
> were considered in the survey article
> "Inter-Coder Agreement for Computational Linguistics" (Arstein,
> Poesion), 2008
> as an option for computing agreement (see end of section 4.3.1).

All evaluation metrics that compare two or more labellings of data can be
used as agreement metrics, and when such an evaluation metric is obvious
for an annotation task, it should be published as well as chance-corrected
version. This allows people to get an idea of how feasible the task is
(because they can see how well humans could do it, when their agreement by
chance is discounted), but also what they might expect an automatic system
to achieve, represented by the standard evaluation metric.

One could say Arstein and Poesio's survey is separately interested in
comparing chance-corrected agreement metrics, and in the history of CL
scholarship caring about agreement metrics. Towards the latter purpose it
describes the ways people have reported agreement, chance-corrected or
otherwise.

The module name 'agreement' is perhaps deceptive, and 'chance_corrected'
may be more ideal; however, these metrics are used and associated
specifically with human inter-annotator agreement measurement (and papers
rarely refer to them as chance corrected, instead asserting the likes of
"we calculated agreement of κ=.70 which someone said was pretty good").

>> Perhaps your contributions belong in nltk.metrics.scores, or even
>> better, in nltk.metrics.segmentation.
>>

> I think nltk.metrics.segmentation is a good idea.
> I also think nltk.metrics.windowdiff should also be inside.

Yes, as far as I'm concerned, go for it. Make windowdiff part of
segmentation, and add other metrics. Hamming distance might also belong in
nltk.metrics.distance, however... I don't know about its generalisation.
One of the key points in implementation is that all metrics should have
the same input parameters as windowdiff if possible.

Similarly, we should probably move the Spearman correlation metric into a
module for ranking comparisons, including Kendall's Tau and Rank Biased
Overlap.

Others: does that seem appropriate? Do we need to ensure
backwards-compatibility?

- Joel

David Doukhan

unread,

Feb 23, 2012, 2:51:10 PM2/23/12

to nltk...@googlegroups.com

2012/2/22 Joel Nothman <jnot...@student.usyd.edu.au>:

Pk and Ghd have been implemented in separate source files for now
(backward compatibility compliant), and I've done a pull request
I may also suggest small changes to windowdiff, that may break
backward compatibility:

* to match the way it has been defined by Prevzner and Hearst, the
result returned by the current implementation should be divided by the
size of the sequence minus k. We could adapt the implementation to
this definition, or explain this in the doc, and let users doing it
themself when using windowdiff.
* authors recommend to set the value of k (window width) to half the
size of reference segments. it may be useful to tell this at least in
the doc. In the Pk implementation I did, k has a default value of
None, meaning half of the reference segment size. having the same
behavior in windowdiff may help for an easier usage.

Regards,

--
David Doukhan

Steven Bird

unread,

Feb 23, 2012, 4:08:19 PM2/23/12

to nltk...@googlegroups.com, David Doukhan

-Steven
(sent from my phone)

On Feb 22, 2012 11:18 PM, "Joel Nothman" <jnot...@student.usyd.edu.au> wrote:
> Do we need to ensure backwards-compatibility?

No, not until 2.0 is finalized. Let's agree on a consistent interface.

-Steven

David Doukhan

unread,

Apr 17, 2012, 9:26:12 AM4/17/12

to nltk...@googlegroups.com

Hi,

I did not have any more news concerning my pull requests.

Morten Neergaard and Joel Nothman suggested some corrections and
improvement that were done.

Let me know if you're still interested by these contributions, or if
some points need to be discussed...

Regards,

2012/2/23 Steven Bird <steve...@gmail.com>:

--
David Doukhan

Joel Nothman

unread,

Apr 17, 2012, 10:17:59 AM4/17/12

to nltk...@googlegroups.com, David Doukhan

IMO this should be pulled, but it's not within my powers to do so.

I still think there's some question about the scope of packages/modules,
but the current solution is acceptable.

~J

Morten Minde Neergaard

unread,

Apr 17, 2012, 12:12:15 PM4/17/12

to nltk...@googlegroups.com

At 15:26, Tue 2012-04-17, David Doukhan wrote:
> Hi,

Hi again =)

> I did not have any more news concerning my pull requests.
>
> Morten Neergaard and Joel Nothman suggested some corrections and
> improvement that were done.

Sorry for disappearing from the discussion, but the deadline for my
master's is coming up and I'm a bit stressed out :)

> Let me know if you're still interested by these contributions, or if
> some points need to be discussed...

I wanted to resume a lagrer discussion on the imports in general (lazy
importers in more places, dropping wildcard imports altogether, reducing
the amount of imports to higher levels ...)

For now, I'm happy with the way things look, and will process the pull
request. Shouldn't have let my thought on refactoring the imports hold
the pull request back!

Smiles,
--
Morten Minde Neergaard

Reply all

Reply to author

Forward