Sentence segmentation, running unit tests

295 views
Skip to first unread message

L. Amber Wilcox-O'Hearn

unread,
Dec 30, 2011, 3:40:10 PM12/30/11
to nltk...@googlegroups.com
Hi!

I'm interested in many aspects of nltk -- It looks like a great
project! -- but this is my first attempt to use it. I encountered a
series of problems in connection with a simple task, and I'm hoping
that someone can help me either figure out how to get the
functionality I want with existing code, or help me to be able to
contribute a patch by showing me how the testing works.

The task I was working on is fairly simple -- I want to pre-process a
corpus for language modelling. I have text in paragraph form, and I
want to segment into sentences and then tokenize such that all words
and all punctuation are separate, but kept. It's the first part that
is giving me a problem. Here is the case:

In [1]: import nltk

In [2]: sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')

In [3]: text = 'Sentences may have "quoted parts at the end."\nAnd it
should still work correctly if the sentence ends with the U.S.\n"We
want to make sure that abbreviations like Mr. Wilcox-O\'Hearn and the
U.S. are treated correctly, too."\n'

In [4]: sent_detector.tokenize(text.strip(), realign_boundaries=True)
Out[4]:
['Sentences may have "quoted parts at the end."',
'And it should still work correctly if the sentence ends with the
U.S.\n"We want to make sure that abbreviations like Mr.
Wilcox-O\'Hearn and the U.S. are treated correctly, too."']

Incidentally, I only figured out the trick about using
realign_boundaries by looking at the source code. If it's in the
docs, I didn't see it.

The problem I haven't solved is that the sentence boundary between the
next two sentences isn't recognized.

Here are the steps I took to try to resolve it:

(1) I wanted to see if the source seemed to address the case of a
known abbreviation followed by a capitalized word that is known to
start sentences. I found code that looks like it does that in
nltk/tokenize/punkt.py

---
# [4.2. Token-Based Reclassification of Abbreviations] If
# the token is an abbreviation or an ellipsis, then decide
# whether we should *also* classify it as a sentbreak.
if ( (aug_tok1.abbr or aug_tok1.ellipsis) and
(not tok_is_initial) ):
# [4.1.1. Orthographic Heuristic] Check if there's
# orthogrpahic evidence about whether the next word
# starts a sentence or not.
is_sent_starter = self._ortho_heuristic(aug_tok2)
if is_sent_starter == True:
aug_tok1.sentbreak = True
return

# [4.1.3. Frequent Sentence Starter Heruistic] If the
# next word is capitalized, and is a member of the
# frequent-sentence-starters list, then label tok as a
# sentence break.
if ( aug_tok2.first_upper and
next_typ in self._params.sent_starters):
aug_tok1.sentbreak = True
return
---

Since 'u.s' is in _params.abbrev_types, it looks like this should work!

(2) I tried to find the appropriate unit test to see if there is a
similar test case that passes.

Here I ran into a lot of problems.

First, I looked in the test directory. The only test that seems to
deal with sentence segmentation is tokenize.doctest, and that only
indirectly. I didn't find any test that actually calls the code I'm
running.

So then I thought I'd run all the unit tests to make sure I wasn't
missing any, and to make sure that if I added or changed code, I
didn't break anything.

Here I got confused.

I tried using doctest_driver, as instructed here:
https://github.com/nltk/nltk/blob/c2f9c49c2f2f4fe3fc513941e840831e2878925e/nltk/test/__init__.py
But doctest_driver.py requires a test file as an argument, and I
didn't see how to use it to run all the tests as a suite.

I looked at all.py, which claims that setup.py uses it, but setup.py
only runs simple.py.

Moreover, setup.py test fails for me, because of precision error:

---
[amber@ubuntu ~/nltk]$ sudo python setup.py test
running test
...
File "/home/amber/nltk/nltk/test/simple.doctest", line 33, in simple.doctest
Failed example:
recall(reference_set, test_set)
Expected:
0.80000000000000004
Got:
0.8
----------------------------------------------------------------------
File "/home/amber/nltk/nltk/test/simple.doctest", line 35, in simple.doctest
Failed example:
f_measure(reference_set, test_set)
Expected:
0.88888888888888884
Got:
0.8888888888888888


----------------------------------------------------------------------
Ran 1 test in 0.038s

FAILED (failures=1)
---

I also tried running testrunner.py. This gives multiple failures
(that are not supposed to fail). Many seem to be based on the fact
that tree.doctest doesn't import Tree before using it. So I concluded
that this must not be the way you usually run tests.

I tried searching throughout the google code pages, and every github
repository, for references to doctest, but without success. I was
fooled by the fact that your book is runnable (which is extremely
cool!) and thought the first chapter might refer to it, but it
didn't.


So I'd like to know
(1) What your usual method of running tests is, so if I want to
contribute patches, I can make sure that they both work properly and
don't break existing code.
(2) Whether you have test for the example I'm worried about.
(3) Whether the existing code can do the example I'm stuck on.

Thank you so much!
Amber

Amber

unread,
Dec 30, 2011, 7:03:46 PM12/30/11
to nltk-dev
Just a quick follow-up.

I traced through a related example, where I want a sentence break to
occur after a token "U.S." and before a token "We" (I.e., I removed
the issue of the quotation mark between them).

It turns out that the reason no break is detected is that "We" is not
in the list of sentence starters, and it fails the ortho_heuristic.
That means it either never occurs lower case in the training corpus,
or occurred capitalized mid-sentence somewhere, right?

If the former, then the training corpus must be either too small or
else a very poor match for my use case.

I'm more inclined to think it is the latter though. I grepped my
corpus for " We ", and there are lots of those, for example, in the
sentence 'A second collection of nursery rhymes, "Now We Are Six", was
published in 1927.' or 'Two verses later we read: "And remember,
Children of Israel, when We made a covenant with you...'

I think this is a flaw in the criteria for sentence starters (unless
I'm misunderstanding). Is there a way around it?

Jan Strunk

unread,
Dec 31, 2011, 9:49:07 AM12/31/11
to nltk...@googlegroups.com
Hello Amber,

I'm one of the original developers of the Punkt sentence boundary
detection system, though I did not port it to NLTk myself.
You're right that the reason why Punkt does not recognize the sentence
boundary in "[...] U.S. We [...]" is that it is somewhat cautious in
assuming sentence boundaries after abbreviations and one of its
strengths is the robustness with regard to orthographic cues, that is,
it can also deal with all uppercase or all lowercase text. As "We" also
occurred uppercase inside sentences in its training corpus and
apparently was not recognized as a frequent sentence starter, it is not
regarded as a reliable cue for a preceding sentence boundary.

This sentence for example occurred in the training corpus:
"It's a typical Carolina team: We run, we're unselfish, and we play
tough defense," Smith says.

(Of course one could discuss whether the colon should be regarded
as a sentence boundary marker here...)

However, for most practical uses, this behavior might be a little
too cautious. Unfortunately, as the NLTK version of Punkt is basically
a direct reimplementation of our "research prototype" plus some extra
features like the "realignment flag", it is not easy to change this
behavior using parameters. I agree that it would be important to
make the NLTK implementation of Punkt more useful for practical purposes
by allowing the tweaking of its behavior. Unfortunately,
I don't have any time for this at the moment, but I'm hoping to
be able to work a little bit on such problems later.

One more general note about Punkt. It is originally intended to be
trained on the corpus it is supposed to segment into sentences
using unsupervised learning. So you could try to train Punkt
directly on your own corpus in order to see whether the results
get better. The ideal way of using Punkt (which some students
of mine have been working on but which is not implemented in NLTK
yet) would be to combine unsupervised learning on the corpus
that is to be segmented itself and a pretrained model if the language
and genre of the text is known beforehand.

As a workaround for your problems, I would suggest postprocessing
the output of the Punkt sentence boundary detector e.g. by inserting
a sentence boundary after an abbreviation followed by uppercase
word that also occurs in a lowercase version in the corpus or some
heuristic like that.

This kind of behavior could also be implemented as an alternative
less-cautious orthographic heuristic in the NLTK implementation of
Punkt so that users could either choose the more cautious or the
less cautious orthographic heuristic.
I'll try to implement that once I find the time.


Best regards,

Jan Strunk
str...@linguistics.rub.de

Sprachwissenschaftliches Institut
Ruhr-Universit�t Bochum
Germany

L. Amber Wilcox-O'Hearn

unread,
Jan 3, 2012, 10:44:27 AM1/3/12
to nltk...@googlegroups.com
Jan, thank you for your reply and suggestions.

I agree that it makes sense to train on the corpus I want to segment
-- that would at least help ensure that the existing abbreviations are
recognized accurately. However, it wouldn't help with the sentence
starting in "We" after an abbreviation, since my corpus has those
examples we talked about.

I could definitely try post-processing as you suggested, but seems
like it would be fairly easy to pass an option as a flag analogous to
realign_boundaries. For example, something like:

sent_detector.tokenize(text.strip(), realign_boundaries=True,
relax_ortho_heuristic=True)

Then instead of post-processing, pass that flag to the call to
_ortho_heuristic, and replace:

# If the word is capitalized, occurs at least once with a
# lower case first letter, and never occurs with an upper case
# first letter sentence-internally, then it's a sentence starter.
if ( aug_tok.first_upper and
(ortho_context & _ORTHO_LC) and
not (ortho_context & _ORTHO_MID_UC) ):
return True

with

# If the word is capitalized, occurs at least once with a
# lower case first letter, and, optionally, never occurs with
an upper case
# first letter sentence-internally, then it's a sentence starter.
if ( aug_tok.first_upper and
(ortho_context & _ORTHO_LC) and
not (relax_ortho_heuristic & ortho_context & _ORTHO_MID_UC) ):
return True

The question, of course, is to what extent it would degrade
performance on other examples. The cases in danger, I think, are
words that are both generic words and proper nouns. If we had a list
of sentences that are prone to this error, that would be helpful, and
I could make a unit test to see if they are affected.

A slightly more sophisticated approach might be to let _ORTHO_MID_UC
be a count or proportion, and use a threshold. That would require
retraining, though, and a bit more work. I too, have a lot of other
priorities right now, so I probably won't try that right now.

Anyway, I think I will try passing a flag, and see if that works for
me. I'll let you know.

Thank you.
Amber

Joel Nothman

unread,
Jan 3, 2012, 8:11:36 PM1/3/12
to nltk...@googlegroups.com, L. Amber Wilcox-O'Hearn

Hi Amber,

I'm responsible for the realign_boundaries flag, which I added when
rewriting the existing implementation. I am sorry it is not more
documented, and the comment in sentences_from_text() should be copied to
tokenize(). I also agree the module needs unit tests; the algorithm is not
trivial, and the implementation needs testing.

But unit tests won't find problems with the model's applicability to a
particular dataset. (Of course if our aim were to improve upon Kiss and
Strunk research system, we could implement regression tests.)

Are you training on your target corpus? While I don't expect this to help
the ortho heuristic, it may put "We" in the sentence starters, which is an
easier win.

The orthographic heuristic is a little bit problematic for training
corpora that are not well edited, especially when large. The heuristic is
learnt with binary flags: if a single instance is seen in training it is
flagged. A probabilistic approach may be better, but requires more memory
in training (which is not one-off), and requires a new threshold parameter
to be evaluated. I first worked with Punkt trained on Wikipedia text,
which is large and variably edited, leaving the ortho heuristic rarely
effective. However as Jan points out, the orthographic heuristic is also
meant to be robust to corpora without capitalisation information.

Good luck bending it to your needs!

Cheers,

~J

L. Amber Wilcox-O'Hearn

unread,
Jan 4, 2012, 10:47:01 AM1/4/12
to Joel Nothman, nltk...@googlegroups.com
Hi Joel,

I love the realign_boundaries flag, and I'm glad I found it. I didn't
mean to criticize -- much of my own code is minimally documented -- I
just meant to let you know that I almost missed it. When I got
started, I followed the book examples. If that is a common scenario,
then maybe putting it in there would be a quick and effective form of
documentation.

I understand what you mean about unit tests not guaranteeing a match
with the dataset. I haven't retrained on my corpus, since I wanted to
first make sure it was going to do what I wanted. Incidentally, the
corpus I'm working on right now is also Wikipedia, which, in addition
to having that variability you mention, has many examples of
mid-sentence capitalized words because of titles. Having worked with
that, do you have any specific tips or tweaks?

I hadn't thought about the fact that using a threshold for a
probabilistic orthographic heuristic would imply another parameter to
be tuned, but I suppose it must. But now I don't understand how you
would tune a parameter without some supervised examples you are
evaluating on. Maybe I ought to look at Kiss and Strunk!

Amber

Joel Nothman

unread,
Jan 4, 2012, 5:51:57 PM1/4/12
to L. Amber Wilcox-O'Hearn, nltk...@googlegroups.com

On Thu, 05 Jan 2012 02:47:01 +1100, L. Amber Wilcox-O'Hearn
<amber.wil...@gmail.com> wrote:

First, I shall summarise the algorithm's overall approach. There are four
sets of model parameters:
* Words known as abbreviations, i.e. words commonly followed by .
* Collocations where the first word ends in a period, such as "Dr. Who"
* Sentence starters, i.e. words that often follow sentence ends or begin
paragraphs.
* Orthographic contexts

The first three are inferred using a standard collocation association
metric (log-likelihood), as well as some other heuristics, e.g. make
something a more likely abbreviation if it has internal periods.

The orthographic contexts are inferred naively from all words seen in
training.


> Incidentally, the
> corpus I'm working on right now is also Wikipedia, which, in addition
> to having that variability you mention, has many examples of
> mid-sentence capitalized words because of titles. Having worked with
> that, do you have any specific tips or tweaks?

Yes, a few:

* Wikipedia has lots of weird cases, though rare, so the binary
orthographic heuristic is a bit problematic. But as Jan said, the
algorithm is designed to work okay when that information is not available.
* Wikipedia uses abbreviations very infrequently (it was never a print
medium with constrained space), and inconsistently uses periods within
acronyms and after abbreviations (like ie, eg, etc, Dr). Nonetheless, when
they are used with periods, you still want to be able to identify them.
Therefore, when training on Wikipedia:
* Decrease the abreviation threshold (e.g.: trainer.ABBREV = .15) so
more are accepted
* Use IGNORE_ABBREV_PENALTY which penalises words appearing here with a
period, there without.
* Use INCLUDE_ALL_COLLOCS = True. This is something I introduced and is
vaguely described in the source, but you have to understand Kiss and
Strunk to get it.
* Manually vet the resulting set of abbreviations, or add known
abbreviations
* The algorithm wasn't ever evaluated with such a large training corpus as
Wikipedia, so there are some things you can do to reduce model blowout,
and memory consumption in training (which, mind you, could happily be
mapreduced):
* Call freq_threshold every now and then during training to remove
infrequently-seen instances.
* By default a collocation (X. Y) may be included simply on the basis of
the association between "X." and "Y". However it may not be worth storing
it in the model if "X. Y" appears only once in training, i.e. your model
may get too big. So you can set a higher MIN_COLLOC_FREQ, in proportion to
the amount of text you've trained on.
* Find a way to incorporate Punkt's parameters into an existing
[semi-]supervised system, or vice-versa!

> I hadn't thought about the fact that using a threshold for a
> probabilistic orthographic heuristic would imply another parameter to
> be tuned, but I suppose it must. But now I don't understand how you
> would tune a parameter without some supervised examples you are
> evaluating on. Maybe I ought to look at Kiss and Strunk!

As an unsupervised learning approach, the parameters of the model are
learnt directly. However, there are hyperparameters, such as frequency or
association metric thresholds, which were determined by Kiss and Strunk by
evaluating on gold standard development data (repeatedly, searching over
parameter values). It's a similar story in other unsupervised learning
approaches, such as clustering, outlier detection, or LDA: you don't need
supervision to get the result, only to know whether your results are good
and to tune the algorithm.

- Joel

L. Amber Wilcox-O'Hearn

unread,
Jan 7, 2012, 1:05:27 PM1/7/12
to Joel Nothman, nltk...@googlegroups.com
Thank you for the summarisation, and the specifics. That's very helpful!

-Amber

Darren Govoni

unread,
Jan 14, 2012, 7:48:05 AM1/14/12
to nltk...@googlegroups.com
Just had a thought related to this. I use this feature quite extensively.

Anyway, I seem to notice that sentences with conjunctions are not broken
into separate sentences (which may be correct behavior FAIK).

In my application, it results in the conjoined sentence being ignored by
my grammar matching rules as its expecting a single sentence structure
not two or more.

Would it make sense to enable the sentence boundary logic to respect
conjunctions (and, but, etc)?

Joel Nothman

unread,
Jan 14, 2012, 8:03:35 PM1/14/12
to nltk...@googlegroups.com, Darren Govoni

Hi Darren,

I am not sure I understand your proposal correctly. Do you mean that for a
sentence like "John likes apples and Mary likes cheesecake.", the SBD
should return ["John likes apples", "and Mary likes cheesecake."]?

While this might be useful for some applications, conjunctions are really
hard to work with, even for full parsers, which require much more
complicated models (and processing overhead) than the average SBD system;
generally sentence boundary detection is a prerequisite to parsing. How
would you handle "John and Mary like apples and cheesecake"?

The thing about SBD is that a naive approach, splitting on all {".", "!",
"?"} is fairly successful, but not quite good enough. An SBD system's job
(for Roman languages) is to disambiguate the uses of '.' and '...', to
decide whether they are ending a sentence or not (by recognising
abbreviations and likely sentence beginnings/endings). Punkt is even naive
enough to assume that all '?' and '!' are sentence boundaries, and in most
cases it is right.

Before getting to conjunctions, a much trickier problem is how to handle
':' and ';'. Sometimes (in English) these separate things that should be
treated by POS and Parsing models as complete sentences, and sometimes
they don't. The general convention is to assume they don't, but were they
handled better, there may possibly be marginal improvements to downstream
performance.

- Joel

On Sat, 14 Jan 2012 23:48:05 +1100, Darren Govoni <dar...@ontrenet.com>
wrote:

L. Amber Wilcox-O'Hearn

unread,
Feb 22, 2012, 7:23:11 AM2/22/12
to nltk...@googlegroups.com
Hello again.

Since last time I wrote, I finally got a chance to work on this
further. Here are the problems I ran into, and what I have done so
far to work around them.

First, the Wikipedia corpus is much, much too big to fit in my space
for training. Ideally, the training would be done streaming, though I
realise this would take a couple of passes. So for now, since I
happen to have my training set sliced into portions of 100,000 lines,
I've just decided to train and segment each one separately. Probably
this is enough data to segment well anyway.

I used the settings that Joel recommended, and they helped.

The biggest problem I had was in dealing with initials. It turns out
that almost every letter in the alphabet occurs in my data lower case
with a period somewhere, eg.:

... logical reasons for doing so were: a. to cement...
... the amir of Algiers, Selim b. Teumi, invited...
...
...vol. v. pp. 86-117...
...
...first-order behavior near x. The cotangent...

That's just the nature of it.

So the algorithm as it stands essentially always splits sentences
between double initials when trained on my corpus.

The problem is in here:

# Special heuristic for initials: if orthogrpahic
# heuristc is unknown, and next word is always
# capitalized, then mark as abbrev (eg: J. Bach).
if ( is_sent_starter == 'unknown' and tok_is_initial and
aug_tok2.first_upper and
not (self._params.ortho_context[next_typ] &
_ORTHO_LC) ): <====
aug_tok1.sentbreak = False
aug_tok1.abbr = True
return

The solution I had suggested before, of just ignoring that line,
creates a large proportion of false non-segments, such as "...the
generals of the Byzantine Emperor, Justinian I. The Byzantine Empire
then retained...".

The tweak that finally gets most of them without also getting false
non-segmentations is to replace the line as follows:

# Special heuristic for initials: if orthogrpahic
# heuristc is unknown, and next word is always
# capitalized when occurring without a period,
# or is itself an initial, then mark as abbrev (eg: J. Bach).
if ( is_sent_starter == 'unknown' and tok_is_initial and
aug_tok2.first_upper and
# not (self._params.ortho_context[next_typ] & _ORTHO_LC) ):
self._params.ortho_context[next_typ] < 46 or
aug_tok2.is_initial):
aug_tok1.sentbreak = False
aug_tok1.abbr = True
return


A quick manual check of the first couple hundred relevant lines using
this tweak recovers almost everything. Here's a sample of the diff
with
'-' for the original code and '+' for the tweak:

-Also M.
-E. Lazarus was an important American individualist anarchist who
promoted free love.
+Also M. E. Lazarus was an important American individualist anarchist
who promoted free love.
@@ -3452,2 +218 @@
-Experiments in Germany led to A.
-S. Neill founding what became Summerhill School in 1921.
+Experiments in Germany led to A. S. Neill founding what became
Summerhill School in 1921.
@@ -4869,2 +1634 @@
-E.
-B. Tylor (2 October 1832 – 2 January 1917) and James George Frazer (1
January 1854 – 7 May 1941) are generally considered the antecedents to
modern social anthropology in Britain.
+E. B. Tylor (2 October 1832 – 2 January 1917) and James George Frazer
(1 January 1854 – 7 May 1941) are generally considered the antecedents
to modern social anthropology in Britain.

A couple of problem sentences remain of the following type:

-When the aged W. E.
-B.
+When the aged W. E. B.

where the original sentence said "When the aged W. E. B. Du Bois...",
but I can live with this much error.


Incidentally, after training on the corpus, I don't actually get the
problem I anticipated with the sentence not breaking between "U.S."
and "We", so that saves me some post-processing.


Anyway, I thought I'd share all of that in case it's useful for anyone
else, or in case someone has further suggestions.

Thank you.
Amber
--
http://scholar.google.com/citations?user=15gGywMAAAAJ

Jan Strunk

unread,
Feb 27, 2012, 7:08:07 PM2/27/12
to nltk...@googlegroups.com
Hello!

I think streaming during training is an excellent idea. I have actually
proposed the subject of incremental learning for sentence boundary
detection (using the Punkt algorithm) to some of my bachelor
students but so far no one has decided to work on this subject.

I think that portions of 100,000 lines are sufficient for a good
performance of the Punkt system. For our Computational Linguistics
article, we tested the system on corpora with an average size of
about 350,000 tokens.

The tweak you propose sounds interesting. I'll try to find some time
to test whether it leads to any degradations in performance on our
test corpora. If not, I would propose to integrate it into the NLTK
implementation of Punkt...

Could you remind us which orthographic contexts
self._params.ortho_context[next_typ] < 46 picks out again?

Best regards,

Jan
str...@linguistics.rub.de

> -B. Tylor (2 October 1832 � 2 January 1917) and James George Frazer (1
> January 1854 � 7 May 1941) are generally considered the antecedents to


> modern social anthropology in Britain.

> +E. B. Tylor (2 October 1832 � 2 January 1917) and James George Frazer
> (1 January 1854 � 7 May 1941) are generally considered the antecedents

Joel Nothman

unread,
Feb 27, 2012, 7:52:06 PM2/27/12
to nltk...@googlegroups.com, Jan Strunk
On Tue, 28 Feb 2012 11:08:07 +1100, Jan Strunk <jan.s...@googlemail.com>
wrote:

> Could you remind us which orthographic contexts
> self._params.ortho_context[next_typ] < 46 picks out again?

46 equates to bit-mask:
(Begin-Upper & Mid-Upper & Unk-Upper & Beg-Lower & Mid-Lower & !Unk-Lower)

so >= 46 means the type:
- occurred lowercase in an unknown sentence position; or
- occurred uppercase at (Beg, Mid, Unk) and lowercase at (Beg, Mid)

and < 46 is the negation thereof...

Amber, are you sure it shouldn't be <= 46? I.e., not (context &
_ORTHO_UNK_LC)?

The previous constraint was: the type never occurred lowercase.

- Joel

L. Amber Wilcox-O'Hearn

unread,
Mar 1, 2012, 11:59:50 AM3/1/12
to nltk...@googlegroups.com
On Mon, Feb 27, 2012 at 5:52 PM, Joel Nothman
<jnot...@student.usyd.edu.au> wrote:
> On Tue, 28 Feb 2012 11:08:07 +1100, Jan Strunk <jan.s...@googlemail.com>
> wrote:
>
>> Could you remind us which orthographic contexts
>> self._params.ortho_context[next_typ] < 46 picks out again?
>
>
> 46 equates to bit-mask:
> (Begin-Upper & Mid-Upper & Unk-Upper & Beg-Lower & Mid-Lower & !Unk-Lower)
>
> so >= 46 means the type:
> - occurred lowercase in an unknown sentence position; or
> - occurred uppercase at (Beg, Mid, Unk) and lowercase at (Beg, Mid)
>
> and < 46 is the negation thereof...
>
> Amber, are you sure it shouldn't be <= 46? I.e., not (context &
> _ORTHO_UNK_LC)?

You're absolutely right, Joel. It should be <=. Thank you!

I didn't test every boundary, and this one is significantly better.
Both with the original line, and with strictly <, strings like "Samuel
A. Ward" are broken after the initial, but <= 46 fixes that. That
affects a vast number of lines.

Amber

--
http://scholar.google.com/citations?user=15gGywMAAAAJ

Darren Govoni

unread,
Mar 1, 2012, 12:03:58 PM3/1/12
to nltk...@googlegroups.com
Hi,
I'm interested in testing the new sentence segmentation stuff with my
own data. How can I get it and try it?

thanks!!
Darren

Joel Nothman

unread,
Mar 2, 2012, 1:54:36 AM3/2/12
to nltk...@googlegroups.com, L. Amber Wilcox-O'Hearn
On Fri, 02 Mar 2012 03:59:50 +1100, L. Amber Wilcox-O'Hearn
<amber.wil...@gmail.com> wrote:

> On Mon, Feb 27, 2012 at 5:52 PM, Joel Nothman
> <jnot...@student.usyd.edu.au> wrote:

>> Amber, are you sure it shouldn't be <= 46? I.e., not (context &
>> _ORTHO_UNK_LC)?
>
> You're absolutely right, Joel. It should be <=. Thank you!

Great! But:

* Using 46 like that in your code is a terrible idea for maintainability.
And <= is not a meaningful operation over a bitmask unless the fields were
designed to be used that way.
* I have realised that I interpreted it incorrectly:

46 == 2 + 4 + 8 + 32.

So you're disallowing (16 & 32) and (64). Or: not ((context &
_ORTHO_BEG_LC & _ORTHO_MID_LC) or (context & _ORTHO_UNK_LC)).

Is that what you mean? The next type can't have been seen lowercase in
training BOTH at the beginning of a sentence and in the middle, and also
can't have been seen lowercase in unknown position?

If so, please don't write it as <= 46, but like the above.


To recap:

The original considered an initial if the next token is capitalised and
its type was NEVER seen lowercase.

You accept an initial if the next token is capitalised and its type was
NEVER seen lowercase, OR was only seen lowercase at the beginning of a
sentence, OR was only seen lowercase in the middle of a sentence (but not
both).

This still seems a strange criterion. Perhaps what you mean is:

You accept an initial if the next token is capitalised and its type was
NEVER seen lowercase, OR was only seen lowercase in the middle of a
sentence.

If so, you could write this as:
not (i & (_ORTHO_BEG_LC | _ORTHO_UNK_LC))
or:
~i & _ORTHO_BEG_LC & _ORTHO_UNK_LC

Or perhaps what you mean is:

You accept an initial if the next token is capitalised and its type was
NEVER seen lowercase in an unknown position (which may or may not have
been the start of a sentence):

~i & _ORTHO_BEG_UNK_LC

Please work out what you mean, or try these various combinations and see
what works!

But please remember that the algorithm is designed to be robust to
different training corpora. In many English text domains, 'I.' alone will
be an ambiguous initial at the end of a sentence. Exceptions occur when
using pronumerals, or anonymised names. And lowercase sentence-initial
words are fairly uncommon too.

- Joel

Jan Strunk

unread,
Mar 5, 2012, 7:53:29 PM3/5/12
to nltk...@googlegroups.com
Thank you, Joel, for working out these details about the bit mask.
I was a little bit confused what condition exactly was specified
with <= 46. I agree that testing bit masks should best be done
using bitwise and logical operators, ideally using the defined
constants from punkt.py.

The improvements in error rate on her data that Amber reported
are, however, very interesting. It would be good to know
how they come about so that the heuristics can perhaps
be improved.

I think that the original orthographic heuristic is probably
a little bit too cautious with detecting initials because
one and the same letter may be used both as an initial,
as a normal word (English "I"), and perhaps as a lowercase
variable, and in a "numbered" list (a. ... b. ... c. ...),
even within one and the same text. Therefore, the restriction
to only consider an (uppercase) single letter (followed by a period)
an initial if the same single letter is never used lowercase
in the same corpus, is usually too strict.

The problem is that there is usually not a lot of orthographic evidence
in the context following an initial because initials are usually
followed by names that occur only in uppercase form or by another
initial. Since names occur with initial uppercase letter at the
beginning and in the middle of sentences, there is no evidence
really not a lot of evidence in a hypothetical case like
the following:

"There I met Charles I. Smith was accompanying me."

We tried to get a little more usable evidence in such cases
in the Punkt system by looking for frequent collocations spanning
a period, such as "I. Smith", and then ruling out sentence boundaries
within such collocations.

We could, however, try to come up with a modified orthographic
heuristic (that is less cautious) and assumes that there is no
sentence boundary following single (uppercase) letters with a period
unless there is good evidence to the contrary -- for example, a word
that is normally written with a lowercase first letter but is now
capitalized in the right context.

Best regards,

Jan
str...@linguistics.rub.de

L. Amber Wilcox-O'Hearn

unread,
Mar 5, 2012, 8:13:39 PM3/5/12
to nltk...@googlegroups.com
Hi, Joel.

On Thu, Mar 1, 2012 at 11:54 PM, Joel Nothman


<jnot...@student.usyd.edu.au> wrote:
> On Fri, 02 Mar 2012 03:59:50 +1100, L. Amber Wilcox-O'Hearn
> <amber.wil...@gmail.com> wrote:
>
>> On Mon, Feb 27, 2012 at 5:52 PM, Joel Nothman
>> <jnot...@student.usyd.edu.au> wrote:
>
>
>>> Amber, are you sure it shouldn't be <= 46? I.e., not (context &
>>> _ORTHO_UNK_LC)?
>>
>>
>> You're absolutely right, Joel.  It should be <=.  Thank you!
>
>
> Great! But:
>
> * Using 46 like that in your code is a terrible idea for maintainability.
> And <= is not a meaningful operation over a bitmask unless the fields were
> designed to be used that way.

Oh yes, of course. It is just a hack, and I would never write it that
way for maintainability. Part of the reason I just left it like that
is that, as you point out below, it might not even work with other
corpora. To really be incorporated into nltk, it would probably need
to be used as some kind of option, depending on the characteristics of
the corpus. I didn't want to change the real code base without access
to testing what already works, so I just made a simple, ugly,
one-liner.

> * I have realised that I interpreted it incorrectly:
>
> 46 == 2 + 4 + 8 + 32.
>
> So you're disallowing (16 & 32) and (64). Or: not ((context & _ORTHO_BEG_LC
> & _ORTHO_MID_LC) or (context & _ORTHO_UNK_LC)).
>
> Is that what you mean? The next type can't have been seen lowercase in
> training BOTH at the beginning of a sentence and in the middle, and also
> can't have been seen lowercase in unknown position?
>
> If so, please don't write it as <= 46, but like the above.

That would definitely be clearer. As you demonstrate, the bitmask is
easy to misinterpret. Even if it's just a little hack in my own code,
it's better for it to be readable. It also makes it more obvious how
arbitrary the condition I'm using is.

> To recap:
>
> The original considered an initial if the next token is capitalised and its
> type was NEVER seen lowercase.
>
> You accept an initial if the next token is capitalised and its type was
> NEVER seen lowercase, OR was only seen lowercase at the beginning of a
> sentence, OR was only seen lowercase in the middle of a sentence (but not
> both).
>
> This still seems a strange criterion. Perhaps what you mean is:
>
> You accept an initial if the next token is capitalised and its type was
> NEVER seen lowercase, OR was only seen lowercase in the middle of a
> sentence.
>
> If so, you could write this as:
> not (i & (_ORTHO_BEG_LC | _ORTHO_UNK_LC))
> or:
> ~i & _ORTHO_BEG_LC & _ORTHO_UNK_LC

(Not sure those are equivalent, but I take your point.)

> Or perhaps what you mean is:
>
> You accept an initial if the next token is capitalised and its type was
> NEVER seen lowercase in an unknown position (which may or may not have been
> the start of a sentence):
>
> ~i & _ORTHO_BEG_UNK_LC
>
> Please work out what you mean, or try these various combinations and see
> what works!

Yes, you're right. To get the best results, I should be systematic.
Instead I just manually looked at the values of the bitmask of some
false positives, false negatives, true positives, and true negatives,
developed a hunch, and verified that the results improved. So in that
sense, I didn't really "mean" anything, because it wasn't based on an
intuition about the case characteristics of the token. I had started
to look at it that way, but honestly, all the "nots" had started to
turn my head in circles.

Let me try to state it in the positive: As it stands, using <=46, I
will break the sentence if the uppercase token following the initial
has been seen lowercase in an unknown position, and I will break the
sentence if the token following the initial has been seen lowercase at
both the beginning and middle. But if I've never seen it lowercase,
or I've seen it lowercase only at what was certainly at the beginning
or certainly the middle, then I will not break the sentence.

Looking at this in a brute force fashion, there would be a lot of
combinations to try. There are 3 different conditions: whether or not
it's been seen lowercase in the beginning, middle, or unknown
position. There are only 8 different combinations of values, but
there would be 2^8 ways to break or not break based on the 8 values,
and I would have to compare them manually or else annotate a corpus
for training.

Of course, most of those would be somewhat bizarre, like the one I
have, and probably not worth testing. The combinations you suggest
seem worth trying. I'll get back to you. I really appreciate your
help!

> But please remember that the algorithm is designed to be robust to different
> training corpora. In many English text domains, 'I.' alone will be an
> ambiguous initial at the end of a sentence. Exceptions occur when using
> pronumerals, or anonymised names. And lowercase sentence-initial words are
> fairly uncommon too.
>
> - Joel


-Amber

L. Amber Wilcox-O'Hearn

unread,
Mar 8, 2012, 12:53:05 PM3/8/12
to nltk...@googlegroups.com
Joel, I need a little more guidance.

I started by trying to verify that re-writing as above would not
change the results I have so far.

You had said:

> So you're disallowing (16 & 32) and (64). Or: not ((context & _ORTHO_BEG_LC
> & _ORTHO_MID_LC) or (context & _ORTHO_UNK_LC)).

So I changed the line

if ( is_sent_starter == 'unknown' and tok_is_initial and

aug_tok2.first_upper and \
self._params.ortho_context[next_typ] <= 46 or aug_tok2.is_initial):

to

if ( is_sent_starter == 'unknown' and tok_is_initial and
aug_tok2.first_upper and \

not ((self._params.ortho_context[next_typ] & _ORTHO_BEG_LC &
_ORTHO_MID_LC) or (self._params.ortho_context[next_typ] &
_ORTHO_UNK_LC)) or aug_tok2.is_initial ):

Is that what you intended?

The result is not exactly the same: 6 lines are different. These 6
had been segmented correctly, but with the new code, 5 are now falsely
unbroken, and 1 falsely broken. That result itself is not a big deal,
out of 100,000 lines, but it shows I'm not doing what I think I'm
doing.

An example where the first code split, but the second code didn't is:

Carbon sublimes in a carbon arc which has a temperature of about 5800
K. Thus, irrespective of its allotropic form, carbon remains solid at
higher temperatures than the highest melting point metals such as
tungsten or rhenium.

I'm not sure I understand the difference.

Amber

Joel Nothman

unread,
Mar 8, 2012, 7:26:05 PM3/8/12
to amber.wil...@gmail.com, nltk-dev

Hi Amber,

Sorry I have not yet replied to either of your most recent emails. I am
busy working on a paper.

I have not given something equivalent, and it is possible that my
assumptions about what might work are incorrect. Even so, 6 in 100,000 may
be insignificant, and you would do well to ensure that this model is worse
than the other using multiple corpora.

To understand what's going on, you need to get the ortho_context for each
type where the decision changed, i.e. 'thus' in the example below, and
decode it into the context flags from training:

def debug_ortho(val):
return [tup for i, tup in enumerate(itertools.product('UL', ('BEG',
'MID', 'UNK'))) if val & 1 << (i + 1)]

so, e.g.

>>> debug_ortho(46)
[('U', 'BEG'), ('U', 'MID'), ('U', 'UNK'), ('L', 'MID')]
>>> debug_ortho(~46)
[('L', 'BEG'), ('L', 'UNK')]

>>> debug_ortho(tokenizer._params.ortho_context['thus'])
??? you tell me...

~J

L. Amber Wilcox-O'Hearn

unread,
Mar 17, 2012, 12:54:30 AM3/17/12
to nltk-dev
On Thu, Mar 8, 2012 at 5:26 PM, Joel Nothman
<jnot...@student.usyd.edu.au> wrote:
>
> Hi Amber,
>
> Sorry I have not yet replied to either of your most recent emails. I am busy
> working on a paper.

No problem. I appreciate the time you've spent already. In fact, I
can't spend much more time perfecting this now either, as I need to
proceed to downstream work.

> I have not given something equivalent, and it is possible that my
> assumptions about what might work are incorrect. Even so, 6 in 100,000 may
> be insignificant, and you would do well to ensure that this model is worse
> than the other using multiple corpora.

Ok. I agree it's negligible, and another set of test sentences could
show the reverse. I was just wanting to make sure I understood what
the suggestion was.

> To understand what's going on, you need to get the ortho_context for each
> type where the decision changed, i.e. 'thus' in the example below, and
> decode it into the context flags from training:
>
> def debug_ortho(val):
>   return [tup for i, tup in enumerate(itertools.product('UL', ('BEG', 'MID',
> 'UNK'))) if val & 1 << (i + 1)]
>
> so, e.g.
>
>>>> debug_ortho(46)
>
> [('U', 'BEG'), ('U', 'MID'), ('U', 'UNK'), ('L', 'MID')]
>>>>
>>>> debug_ortho(~46)
>
> [('L', 'BEG'), ('L', 'UNK')]
>
>>>> debug_ortho(tokenizer._params.ortho_context['thus'])
>
> ??? you tell me...

Well, perhaps not surprisingly, in this case it's [('U', 'BEG'), ('U',
'MID'), ('U', 'UNK'), ('L', 'BEG'), ('L', 'MID')].

I may take this most recent iteration as good enough for now. It's
logically cleaner than what I had, and not significantly different.

Amber

Darren Govoni

unread,
Mar 23, 2012, 8:34:25 AM3/23/12
to nltk...@googlegroups.com
Hi,
Is there a way us nltk users can try the latest sentence segmentation
routines and see how they perform?

thanks.

xinfan meng

unread,
Mar 23, 2012, 8:36:48 AM3/23/12
to nltk...@googlegroups.com
I think you can clone the source codes and run the routines.

--
You received this message because you are subscribed to the Google Groups "nltk-dev" group.
To post to this group, send email to nltk...@googlegroups.com.
To unsubscribe from this group, send email to nltk-dev+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nltk-dev?hl=en.




--
Best Wishes
--------------------------------------------
Meng Xinfan(蒙新泛)
Institute of Computational Linguistics
Department of Computer Science & Technology
School of Electronic Engineering & Computer Science
Peking University
Beijing, 100871
China

Steven Bird

unread,
Mar 23, 2012, 5:25:17 PM3/23/12
to nltk...@googlegroups.com
None of this discussion has (yet) resulted in changes to the official
version of punkt.py.

If there's modifications that anyone wants to share, please just post
them on the issue tracker -- i.e. open a new issue and attach your
version.

Thanks, -Steven

Darren Govoni

unread,
Mar 23, 2012, 7:32:00 PM3/23/12
to nltk...@googlegroups.com
Ok. Thanks for the heads up. I'm happy to compare new algorithms for
this when they become available. I have tons of data where the
segmentation is a little off in the current code.

Joel Nothman

unread,
Mar 24, 2012, 5:30:45 AM3/24/12
to nltk...@googlegroups.com, Darren Govoni

Hi Darren,

Amber's suggestions only modify the application of orthographic rules
following an initial. Does this constitute a substantial proportion of
your errors?

If so, perhaps we should make different orthographic heuristic modes for
more lenient situations. In any case there are various parameters that can
be modified, but only one setting has been selected by the Punkt authors
for cross-corpus performance, and we do not yet know anything about how
well proposed changes generalise.

Punkt was tuned for numerous corpora in a variety of languages and text
capitalisations. If you're not in need of Punkt's domain and language
flexibility, and just want an English-language corpus segmented, why not
use a high-performance supervised system like
http://code.google.com/p/splitta/?

Cheers,

- Joel

On Sat, 24 Mar 2012 10:32:00 +1100, Darren Govoni <dar...@ontrenet.com>
wrote:

> Ok. Thanks for the heads up. I'm happy to compare new algorithms for

Darren Govoni

unread,
Mar 24, 2012, 6:40:07 AM3/24/12
to Joel Nothman, nltk...@googlegroups.com
Hi Joel,
Thanks for the suggestion. A quick run of sbd code produces a good
more valid sentences from my data (where the sentences are not
necessarily cleanly represented).

The multilingual aspects of Punkt however are truly useful and I will
find a combination of these a good strategy.

Are the other parameters available for Punkt documented? I will try them
out as well.

Best,
Darren

Joel Nothman

unread,
Mar 24, 2012, 10:06:33 AM3/24/12
to Darren Govoni, nltk...@googlegroups.com
Okay. I've done enough individual Q&A on this.

I've just introduced PunktSentenceTokenizer.debug_decisions in a patch at
https://github.com/jnothman/nltk/blob/master/nltk/tokenize/punkt.py

Given text, the method generates a dictionary giving data on each sentence
boundary decision.

For example:

>>> import nltk.corpus
>>> from nltk.tokenize import punkt
>>> text = nltk.corpus.gutenberg.raw('austen-emma.txt')
>>> trainer = punkt.PunktTrainer(text)
>>> tokenizer = punkt.PunktSentenceTokenizer(trainer.get_params())
>>> decisions = tokenizer.debug_decisions(text)
>>> print punkt.format_debug_decision(decisions.next())
Text: 'her.\n\nShe' (at offset 288)
Sentence break? True (default decision)
Collocation? False
'her.':
known abbreviation: False
is initial: False
'she':
known sentence starter: True
orthographic heuristic suggests is a sentence starter? unknown
orthographic contexts in training: set(['BEG-UC', 'UNK-UC', 'UNK-LC',
'MID-LC', 'BEG-LC', 'MID-UC'])

>>> print punkt.format_debug_decision(decisions.next())
Text: 'period. Her' (at offset 476)
Sentence break? True (default decision)
Collocation? False
'period.':
known abbreviation: False
is initial: False
'her':
known sentence starter: False
orthographic heuristic suggests is a sentence starter? unknown
orthographic contexts in training: set(['UNK-UC', 'UNK-LC', 'BEG-UC',
'MID-LC', 'MID-UC'])


If, for example, the abbreviations seem too conservative, try modifying
the training parameters (ABBREV, IGNORE_ABBREV_PENALTY, ABBREV_BACKOFF),
each documented in the source. If collocations are a problem, see
COLLOCATION, INCLUDE_ALL_COLLOCS, INCLUDE_ABBREV_COLLOCS, MIN_COLLOC_FREQ.
And so on. Or if you have the time/care for it, do an incremental
parameter search over a portion where you have marked gold standard
sentence breaks...

Or you can more directly inspect the model: look at
trainer._params.abbrev_types, etc. Adjust the training parameters as
necessary, or artificially add cases as you find them.

Good luck!

- Joel

On Sat, 24 Mar 2012 21:40:07 +1100, Darren Govoni <dar...@ontrenet.com>

Reply all
Reply to author
Forward
0 new messages