Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion Sentence segmentation, running unit tests

Received: by 10.68.202.168 with SMTP id kj8mr6802041pbc.1.1332598009820;
        Sat, 24 Mar 2012 07:06:49 -0700 (PDT)
X-BeenThere: nltk-dev@googlegroups.com
Received: by 10.68.223.72 with SMTP id qs8ls13161985pbc.6.gmail; Sat, 24 Mar
 2012 07:06:49 -0700 (PDT)
Received: by 10.68.202.168 with SMTP id kj8mr6802024pbc.1.1332598009452;
        Sat, 24 Mar 2012 07:06:49 -0700 (PDT)
Received: by 10.68.202.168 with SMTP id kj8mr6802023pbc.1.1332598009437;
        Sat, 24 Mar 2012 07:06:49 -0700 (PDT)
Return-Path: <joel.noth...@gmail.com>
Received: from mail-pb0-f42.google.com (mail-pb0-f42.google.com [209.85.160.42])
        by gmr-mx.google.com with ESMTPS id 6si10994112pbg.2.2012.03.24.07.06.49
        (version=TLSv1/SSLv3 cipher=OTHER);
        Sat, 24 Mar 2012 07:06:49 -0700 (PDT)
Received-SPF: pass (google.com: domain of joel.noth...@gmail.com designates 209.85.160.42 as permitted sender) client-ip=209.85.160.42;
Authentication-Results: gmr-mx.google.com; spf=pass (google.com: domain of joel.noth...@gmail.com designates 209.85.160.42 as permitted sender) smtp.mail=joel.noth...@gmail.com; dkim=pass header...@gmail.com
Received: by mail-pb0-f42.google.com with SMTP id un1so3191014pbc.15
        for <nltk-dev@googlegroups.com>; Sat, 24 Mar 2012 07:06:49 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20120113;
        h=sender:content-type:to:cc:subject:references:date:mime-version
         :content-transfer-encoding:from:message-id:in-reply-to:user-agent;
        bh=smffxh0g/o2hJ+HPm9N3fwVRJzA7pst8nwG+Y8Whmuo=;
        b=ny7uj3pvkix8vFvgvlVaJEhA6M6W4MfU/92fQ7JmkZw5Y4NklDCJTUmX7jcvh1OEu+
         RJPdlUNW3JEd7OUIE3WJIbUu0Btk3rpx+obaLme/x65Tdh9a3IbdL/SF/oEac94hY2yE
         lC9RRbiEv6ThY303br/ibYoyiQzzedsSglXvkkuW4prsZjuQgXTfVgVbJ61oHD1BVnpG
         hX0PKQlSBZdDjzL5C0W6wBYhC3jw+gL+SnXJkl/NWOu+mkIhytctCNMv9gByUyAG8bdy
         0LY0b6Oj3rAfJo3DZdmDIsOriwq3fBwZjG0o0YBmFNk9eKOTvl8vkdNAeM14frFhp8yc
         PdnQ==
Received: by 10.68.223.42 with SMTP id qr10mr37746684pbc.127.1332598009230;
        Sat, 24 Mar 2012 07:06:49 -0700 (PDT)
Return-Path: <joel.noth...@gmail.com>
Received: from wifi-joel-2.cs.usyd.edu.au (110-174-21-175.static.tpgi.com.au. [110.174.21.175])
        by mx.google.com with ESMTPS id i1sm6325422pbv.49.2012.03.24.07.06.44
        (version=TLSv1/SSLv3 cipher=OTHER);
        Sat, 24 Mar 2012 07:06:47 -0700 (PDT)
Sender: Joel Nothman <joel.noth...@gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed; delsp=yes
To: "Darren Govoni" <dar...@ontrenet.com>
Cc: nltk-dev@googlegroups.com
Subject: Re: [nltk-dev] Re: Sentence segmentation, running unit tests
References: <CAK+oVnWc9kk-_yLox0sWhKnRHdEBxU-_K-b77406KccY39x...@mail.gmail.com> <0075a3d4-2179-4cd3-85f0-33ca75ecf...@h3g2000yqa.googlegroups.com> <4EFF20E3.70...@googlemail.com> <CAK+oVnVw_j5GHUrUdGDQ7e1Xi=++3SWcjkZ5fv_eD4jEwUf...@mail.gmail.com> <CAK+oVnVMN6E=qxWf61hFVec9Lvr7FHANykKGJr55VxEdBj6...@mail.gmail.com> <4F4C1AE7.1000...@googlemail.com> <op.wac504gonxj...@wifi-joel-2.cs.usyd.edu.au> <CAK+oVnUO3MH52-Re=CS_42azdghS7sD6H_m68HZ8Z=ODxxS...@mail.gmail.com> <op.wai6tausnxj...@joels-macbook.local> <CAK+oVnXT8arhLvveU+WAG8NM6NkqFSow9JHnCX3JG1EqbmA...@mail.gmail.com> <CAK+oVnUc-S1T-oe+FP31nMN_mCK3S0on9bqhVcBmhHQsG8R...@mail.gmail.com> <op.wavnhrsvnxj...@joels-macbook.local> <CAK+oVnVV70+uE7rT409tS9bxEZLNgk=Nqj6e+98r98iijV2...@mail.gmail.com> <1332506065.2191.0.camel@tungsten> <CAHvKU2wKNd91DyDUhB2mtbA7aZ_eiBOPHex5uX8RQDnjZoP...@mail.gmail.com> <CAAxp+om1SSOFDCk22_2g6hGK+sAUGKr+9A5amiSm8sA5kmh...@mail.gmail.com> <1332545520.18607.2.camel@tungsten>
 <op.wbn4pjtunxj...@wifi-joel-2.cs.usyd.edu.au> <1332585607.24828.3.camel@tungsten>
Date: Sun, 25 Mar 2012 01:06:33 +1100
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
From: "Joel Nothman" <jnoth...@student.usyd.edu.au>
Message-ID: <op.wbohg7mmnxj...@wifi-joel-2.cs.usyd.edu.au>
In-Reply-To: <1332585607.24828.3.camel@tungsten>
User-Agent: Opera Mail/11.61 (MacIntel)

Okay. I've done enough individual Q&A on this.

I've just introduced PunktSentenceTokenizer.debug_decisions in a patch at  
https://github.com/jnothman/nltk/blob/master/nltk/tokenize/punkt.py

Given text, the method generates a dictionary giving data on each sentence  
boundary decision.

For example:

>>> import nltk.corpus
>>> from nltk.tokenize import punkt
>>> text = nltk.corpus.gutenberg.raw('austen-emma.txt')
>>> trainer = punkt.PunktTrainer(text)
>>> tokenizer = punkt.PunktSentenceTokenizer(trainer.get_params())
>>> decisions = tokenizer.debug_decisions(text)
>>> print punkt.format_debug_decision(decisions.next())
Text: 'her.\n\nShe' (at offset 288)
Sentence break? True (default decision)
Collocation? False
'her.':
     known abbreviation: False
     is initial: False
'she':
     known sentence starter: True
     orthographic heuristic suggests is a sentence starter? unknown
     orthographic contexts in training: set(['BEG-UC', 'UNK-UC', 'UNK-LC',  
'MID-LC', 'BEG-LC', 'MID-UC'])

>>> print punkt.format_debug_decision(decisions.next())
Text: 'period.  Her' (at offset 476)
Sentence break? True (default decision)
Collocation? False
'period.':
     known abbreviation: False
     is initial: False
'her':
     known sentence starter: False
     orthographic heuristic suggests is a sentence starter? unknown
     orthographic contexts in training: set(['UNK-UC', 'UNK-LC', 'BEG-UC',  
'MID-LC', 'MID-UC'])


If, for example, the abbreviations seem too conservative, try modifying  
the training parameters (ABBREV, IGNORE_ABBREV_PENALTY, ABBREV_BACKOFF),  
each documented in the source. If collocations are a problem, see  
COLLOCATION, INCLUDE_ALL_COLLOCS, INCLUDE_ABBREV_COLLOCS, MIN_COLLOC_FREQ.  
And so on. Or if you have the time/care for it, do an incremental  
parameter search over a portion where you have marked gold standard  
sentence breaks...

Or you can more directly inspect the model: look at  
trainer._params.abbrev_types, etc. Adjust the training parameters as  
necessary, or artificially add cases as you find them.

Good luck!

- Joel

On Sat, 24 Mar 2012 21:40:07 +1100, Darren Govoni <dar...@ontrenet.com>  
wrote:

> Hi Joel,
>     Thanks for the suggestion. A quick run of sbd code produces a good
> more valid sentences from my data (where the sentences are not
> necessarily cleanly represented).
>
> The multilingual aspects of Punkt however are truly useful and I will
> find a combination of these a good strategy.
>
> Are the other parameters available for Punkt documented? I will try them
> out as well.
>
> Best,
> Darren
>
> On Sat, 2012-03-24 at 20:30 +1100, Joel Nothman wrote:
>> Hi Darren,
>>
>> Amber's suggestions only modify the application of orthographic rules
>> following an initial. Does this constitute a substantial proportion of
>> your errors?
>>
>> If so, perhaps we should make different orthographic heuristic modes for
>> more lenient situations. In any case there are various parameters that  
>> can
>> be modified, but only one setting has been selected by the Punkt authors
>> for cross-corpus performance, and we do not yet know anything about how
>> well proposed changes generalise.
>>
>> Punkt was tuned for numerous corpora in a variety of languages and text
>> capitalisations. If you're not in need of Punkt's domain and language
>> flexibility, and just want an English-language corpus segmented, why not
>> use a high-performance supervised system like
>> http://code.google.com/p/splitta/?
>>
>> Cheers,
>>
>> - Joel
>>
>> On Sat, 24 Mar 2012 10:32:00 +1100, Darren Govoni <dar...@ontrenet.com>
>> wrote:
>>
>> > Ok. Thanks for the heads up. I'm happy to compare new algorithms for
>> > this when they become available. I have tons of data where the
>> > segmentation is a little off in the current code.
>> >
>> > On Sat, 2012-03-24 at 08:25 +1100, Steven Bird wrote:
>> >> None of this discussion has (yet) resulted in changes to the official
>> >> version of punkt.py.
>> >>
>> >> If there's modifications that anyone wants to share, please just post
>> >> them on the issue tracker -- i.e. open a new issue and attach your
>> >> version.
>> >>
>> >> Thanks, -Steven
>> >>
>> >> On 23 March 2012 23:36, xinfan meng <mxf3...@gmail.com> wrote:
>> >> > I think you can clone the source codes and run the routines.
>> >> >
>> >> >
>> >> > On Fri, Mar 23, 2012 at 8:34 PM, Darren Govoni  
>> <dar...@ontrenet.com>
>> >> wrote:
>> >> >>
>> >> >> Hi,
>> >> >>  Is there a way us nltk users can try the latest sentence
>> >> segmentation
>> >> >> routines and see how they perform?
>> >> >>
>> >> >> thanks.
>> >> >>
>> >>
>> >
>>
>