Doing really, really good sentence boundary detection is an on-going
problem in natural language processing. I'm not aware of any Ruby-
based NLP packages, but if you want better accuracy than just using
[.!?:] there are several free NLP packages around (NLTK in Python,
and Stanford's Java NLP package spring to mind) that might help you.
A googling of "sentence tokenization" may also yield some help.
If that sounds like overkill, then you can get accuracy "good enough
for government work" by making a list of regular expressions to catch
exceptions to the punctuation rule. These will necessarily vary a
little depending on your source text, but a typical examples are
catching titles like "Mr.", "Mrs." "Dr.", and all-caps abbreviations
like "U.S.A." or "M.D." (something like this: /([A-Z]\.([A-Z]\.)+/)
good luck,
matthew smillie.
----
Matthew Smillie <M.B.S...@sms.ed.ac.uk>
Institute for Communicating and Collaborative Systems
University of Edinburgh
_Kevin
Nick
--
Nicholas Van Weerdenburg
It's a common convention to separate sentences by double spaces. I
started following this convention because Emacs expected it, and now I
use it always.
I too learned the two space after a period convention years ago and
also recently learned that with modern fonts and word processors it is
not necessary. It was tricky to retrain myself, but I did, and have
been using just one space ever since.
So like you say, that isn't a reliable way to discern sentences.
I would recommend following the advice of first filtering out false
positives (possibly even replacing them with temporary markers, Mr.
becomes $MISTER$ or similar), then splitting on punctuation. If you
then test on various sample texts you should be able to find more
false positives that you might have missed.
Ryan
____________________________________________________________________
http://www.freemail.gr - δωρεάν υπηρεσία ηλεκτρονικού ταχυδρομείου.
http://www.freemail.gr - free email service for the Greek-speaking.
If you make a regexp: [.!?]\s+[A-Z] you will already capture most. Most
Abbreviations normally aren't followed by a space/capital letter.
One change to this rule that I can think of is Mr. Name, Mrs. Name. But
as you can see these have a <uppercase> followed by only one or two
downcase letters. Most sentences would have at least five non uppercase
in front of the <.> ->
[A-Z]\w\w?\w?\w?\.
--
Posted via http://www.ruby-forum.com/.
That, in fact, is a very *bad* metric to follow, as the proper spacing
after sentence punctuation is a single space. The only reason that two
spaces was used in the past is the space used between sentence endings
in typeset work is a little wider than that used between words (an
em-space vs. an en-space).
-austin
--
Austin Ziegler * halos...@gmail.com
* Alternate: aus...@halostatue.ca
Look at Text::Format for some indication on how abbreviations could be handled.
As I noted above, this is an improper convention outside of the
typewriter realm. If you are using anything other than a fixed-pitch
font for display or print, you should *never* use two spaces.
Alternatively, use text processing systems that do the "right thing";
i.e. transform two spaces into one (e.g. TeX, HTML-based products).
There is no good reason a text processor should show two spaces after
each other in print.
> -austin
--
Christian Neukirchen <chneuk...@gmail.com> http://chneukirchen.org
Not true at all. I was always taught to use double spaces after
sentences in grade-school homework assignments done on plain word
processors or typewriters.
Many of us were and I'll admit that I can't shake the habit. I still
know it's wrong though. ;)
James Edward Gray II
Then, quite honestly, you were taught wrong. I was taught to use
double spaces with a typewriter or when using fixed-pitch fonts
(although that was later, since most computers and printers didn't
have reliable kerning routines until I was out of university).
Ultimately, the use of double spaces after a period is wrong *even
with fixed-pitch fonts*, but it was done to be clearer since the width
of the em-space and an en-space on a typewriter with a Courier-like
font is exactly the same. The two spaces *simulates* an em-space in a
typeset piece of work. (And that is *fact*, not opinion.)
Here is a great treatment on the topic,
http://www.webword.com/reports/period.html
The utility of this method for determining the end of a sentence depends
entirely on the purpose of the program. If I were to write a routine to
parse text that I wrote, it would probably work pretty well, and it would
save me several hours of work trying to implement a fancier, more robust
routine.
The same routine would probably fail horribly for other users or a more
generic corpus of text.
As a general rule, I like to use algorithms that are as simple as possible
for the job. That, of course, depends a lot on what the job is.
Funny, I never thought something like spacing between sentences would be so
controversial. I can almost envision _why making an esoteric remark about
the beauty of 'negative space' in text files.
_Kevin
-----Original Message-----
From: Austin Ziegler [mailto:halos...@gmail.com]
Sent: Wednesday, November 30, 2005 12:40 PM
To: ruby-talk ML
Subject: Re: Splitting a text file into sentences
The Bedford Handbook, which has been my bible for writing conventions
through the past ten years, lists two sets of guidelines: Those
recommended by the Modern Language Association (MLA), and those
recommended by the American Psychological Association (APA). It says
that the MLA style is typically taught in English classes, but that the
APA style is common in the social sciences. Here is the explanation of
the MLA guidelines, from page 633 of the Bedford Handbook for Writers,
(c) 1994:
MLA Guidelines [for essays]:
In typing the text of the essay, leave one space after words, commas,
colons, and semicolons and between the dots in ellipsis marks. Leave
two spaces after periods, question marks, and exclamation points.
To form a dash, type two hyphens with no space between them. Do not
put a space on either side of a dash.
The Handbook goes on to say (p. 635):
Although the APA guidelines call for one space after all punctuation,
most college professors prefer two spaces at the end of a sentence. Use
one space after all other punctuation.
Although two spaces are used after a period that ends a sentence, use
only one space after a period that follows a person's initial (B.F.
Skinner).
To form a dash, type two hyphens with no space between them. Do not
put a space on either side of a dash.
The Handbook itself uses only single spaces at the ends of sentences.
Still, I hardly think there is one conclusively "right" or "wrong"
convention. Until I am convinced otherwise, I will continue to use two
spaces to separate sentences. This makes sentences easier to lex with
regular expressions, and makes them stand out to text editors and human
readers.
"Right" or "wrong" in this kind of styling has to do with whether
something is right or wrong according to a particular convention.
The normal convention for professional typography is to use one space
between sentences, whether you are convinced or not, whether using hard
type, a professinoal typesetting program, a desktop publishing program,
or a word processing program.
The older typewiter conventions are still often requested for
manuscripts for academic essays and mansucripts for submission to
publishing houses. These conventions also require underlining rather
than italics, use of double-hyphen for a dash rather than the specific
dash character, and so forth. But should this same manuscript be
professionally printed, even if the text is actually to be set by a
word processor, it would almost certainly be edited first to convert it
to typographical standard: changing all double-spaces to single spaces,
all occurrences of double-hyphen to em-dash or en-dash, using fancy
quotation marks instead of possible straight typewriter quotation
marks, italics instead of underlining, and so forth.
Note that HTML has from the beginning automatically changed any
multiple runs of spaces into a single space when displaying text.
Yes, a convention of always using two spaces would make sentences
easier to lex with regular expressions. Similarly, enforcing one single
spelling of English throughout the world would make searches and
matches easier. However, it is philosphically unsound to ask that the
world change to fit particular data-processing routines, rather than
that data-processing routines be built to properly to deal with
real-world situations.
If your lexing routine fails because many people don't end
non-paragraph-final setences with double-spaces, or do so only in
particular plain text files, it is the fault of your lexing routine for
failing to handling common formatting, unless your lexing is intended
to be a limited tool that works only with manuscript formatted text.
The best general sentence lexing algorithm I've seen is the one set
forth by the Unicode Consoritium at
http://www.unicode.org/reports/tr29/tr29-4.html#Sentence_Boundaries .
This is designed to work reasonably well in any language and writing
system supported by Unicode, not just in English.
Jallan
For improved legibility, inter-sentence space should generally be a bit
greater than inter-word space.
Typewriters only had one distance they could travel. Either 1/10th of
an inch ("Pica") or 1/12th ("Elite"). So the only way to add extra
space after a sentence was to double it. That's way too much extra
space, but it was generally better than the alternative. The real
problem was that the words were too far apart, not that the sentences
were too close, but again, the fixed spacing was already an abominable
situation.
Proportional type, dating all the way back to Gutenberg, would
generally use 1/3rd or 1/4th of the height of type type as the
inter-word spacing. This would usually work out to about the width of a
lower case "t" or "l".
When setting modern (by which you may also read "all type before
typewriters" as well) proportional type in fully justified form (left
and right margins both even), the spaces must be stretched out on a
line-by-line basis to fit. Really good typesetting programs (and really
good typesetters sticking little bits of lead between their words (and
I've done that, too)) will add more of the space between sentences than
between words, so as the line stretches, the inter-word space to
inter-sentence space ratio actually changes. (Take a look at a narrow
newspaper column sometime.)
More sophisticated approaches to space will ignore a user's attempt to
sprinkle extraneous space in. Less sophisticated ones might allow it,
and even treat them as individual spaces, stretching both of them
during expansion. {shudder}
The fact that both the MLA Guidelines and the Bedford Handbook
encourage poor typography is regrettable. ("If you cannot type
appropriate punctuation, e.g. an em-dash or en-dash, please use
appropriate substitutions. For both dashes, substitute a pair of
hyphens, which, like true dashes, are typed without adjacent spaces."
There's still software out there that will happily wrap a line between
the two hyphens. Ick!) Nevertheless, if you're submitting a paper to an
institution that expects or requires that, then to not follow them is
wrong, even if the legibility of the submission is better.
What it all boils down to is "Putting two spaces after a period at the
end of a sentence is an artifact left over from the days when the
typewriter was the prevalent text-making tool. Unless you have a
specific reason or requirement to do otherwise, it's preferable to put
only one space between sentences."
*****
For breaking text into sentences, sometimes I find it easier to work
backwards. Also, only very colloquial writing will have a one-word
sentence, so you can solve all "Mr./Dr./Ph.D." cases by the fact that
if a word starts with a cap and ends with a period, it's not a
sentence. For a more sophisticated approach that's still not too
complex to program, check the final word of a sentence against a
dictionary. If it's found there without a final dot, then you're almost
certainly looking at the end of a sentence. If it isn't, then is it
found anywhere else in the document without a dot? If not, then you're
probably looking at an abbreviation. (My mail program uses a monospaced
font. If I thought most readers would read it with a proportional font,
I'd have typed "Ph. D." above, since it should have a thin space before
the D.)
What rot. How can anything like that be a fact? You're regurgitating
the opinion of a style manual.
Gavin
> you can solve all "Mr./Dr./Ph.D." cases by the fact that if a word
> starts with a cap and ends with a period, it's not a sentence.
I'm not sure that's a very good rule, Dave. There are two sentences
here.
The above rule may catch titular abbreviations, but over-generalises
to produce a false negative in the above example. So in solving one
problem, you introduce another one. It's relatively easy to make
another rule to catch the problem in this case, but it would probably
have been simpler to just make a specific rule to eliminate titular
abbreviations, since there really aren't that many of them.
matthew smillie.
This is what I love about Usenet. :)
Um. No, I'm stating fact. This isn't mere opinion: two spaces were done
to simulate em-spaces in fixed pitch environments. That's a fact. The
reason for that may often be forgotten, but it *remains* a fact. Please
remember that I've done quite a bit of typesetting-style work in the
last year with PDF::Writer and I have to know a bit more about this than
most folks, and it's something of a hobby of mine in any case to know
about printing mechanisms.
The only *opinion* I stated was that the first poster in the chain above
(I think Jeffrey) was taught wrongly. I maintain that as true
regardless, because if he was taught two spaces without the reason why,
then there's a practice being repeated for no good reason.
The practice is nonsense these days in most contexts.
> On Nov 30, 2005, at 22:02, Dave Howell wrote:
>
>> you can solve all "Mr./Dr./Ph.D." cases by the fact that if a word
>> starts with a cap and ends with a period, it's not a sentence.
>
> I'm not sure that's a very good rule, Dave. There are two sentences
> here.
>
> The above rule may catch titular abbreviations, but over-generalises
> to produce a false negative in the above example.
I hadn't intended to provide a single magical rule that was perfect in
isolation, after all. {chuckle}
"Ph. D." is not a sentence. But where do you break
My name is Dave, Ph. D. Pleased to meet you.
vs.
You need my Ph. D. friend Dave to help you.
I don't think having a list of abbreviations and titles will improve
that situation much, although it's a lot more work and almost certain
to be incomplete. Any/every rule will have failures; avoiding them is
what takes you into that whole natural language high-octane engine
situation.
However, if you also use the *other* "rule" I mentioned, then you don't
have a problem. "Dave Howell" appears just a couple lines earlier,
establishing "Dave" as a word that doesn't require a period. Therefore,
it's more likely to be at the end of a sentence. The following word
("There") can be found in a dictionary, and in a non-capitalized form,
which means that its capitalization here following a dot strongly
indicates that it's beginning a sentence.
The capital "P" of "Ph." is not preceded by a period either time, so
it's not starting a sentence. After it, "friend" isn't capitalized, so
it's not ending a sentence. But "Pleased" is, and dictionary says "not
normally capitalized" so that's probably a sentence break.
Dave Howell:
> For improved legibility, inter-sentence space should
> generally be a bit greater than inter-word space.
It's worth noting that actually turning this theory into reality seems
to apply to 'Western' (American, British, others?) typography (mostly?
only?).
I've yet to see a typical modern Polish book typeset with greater
inter-sentence spaces. Also (and, I guess, as a result of this),
I doubt I ever saw any Polish email or Usenet post with two
inter-sentence spaces, and I remember how happy I was to find
out about the 'joinspaces' vim option that finally let me reflow
paraghaprs properly, without doing a s/ / /g on them afterwards. :o)
Cheers,
-- Shot
--
He has never been known to use a word that might send a reader
to the dictionary. -- William Faulkner on Ernest Hemingway
====================== http://shot.pl/hovercraft/ === http://shot.pl/1/125/ ===
Fair enough.
Gavin
>
> On Nov 30, 2005, at 15:35, Matthew Smillie wrote:
>
>> On Nov 30, 2005, at 22:02, Dave Howell wrote:
>>
>>> you can solve all "Mr./Dr./Ph.D." cases by the fact that if a
>>> word starts with a cap and ends with a period, it's not a sentence.
>>
>> I'm not sure that's a very good rule, Dave. There are two
>> sentences here.
>>
>> The above rule may catch titular abbreviations, but over-
>> generalises to produce a false negative in the above example.
>
> I hadn't intended to provide a single magical rule that was perfect
> in isolation, after all. {chuckle}
Didn't assume you were! It was just a good example to use for a
"this can be harder than it looks" couple of lines of warning, since
it's been my experience that people don't anticipate false negatives
as well as they do false positives.
matthew smillie.
The abbreviation for Mister is Mr.
The head office is in New York, N.Y.
In other words, abbreviations that end a sentence. These sentences
don't end with a double dot, so if we replace Mr. with $MISTER$, the
sentence has no end marker.
Hmmm.
basi
>
> On Nov 30, 2005, at 15:35, Matthew Smillie wrote:
>
>> On Nov 30, 2005, at 22:02, Dave Howell wrote:
>>
>>> you can solve all "Mr./Dr./Ph.D." cases by the fact that if a word
>>> starts with a cap and ends with a period, it's not a sentence.
>>
>> I'm not sure that's a very good rule, Dave. There are two sentences here.
>>
>> The above rule may catch titular abbreviations, but over-generalises to
>> produce a false negative in the above example.
>
> I hadn't intended to provide a single magical rule that was perfect in
> isolation, after all. {chuckle}
>
>
Want some magick? You are stuck in wrong coordinate
system, like Newton. Stop thinking in terms of words and syntax
rules governing how to put them in correct order. Think
links (alinka). Think relations and revelation.
Words (symbols) have no meaning. None. They *are* empty.
If you want to infiltrate enemy ogranization the most
effective method is not drilling into individual agents,
but monitoring their communications (that is, relations).
If you aquire enough of those relations (and recursively,
but set some boundary unless you are Goddess and can
do anything you fancy) you don't even need to decrypt the
messages, unless you are bored. To destroy enemy
organization, mess with the relations. Agents (symbols,
words, punctuation marks...) are of no importance
whatsowherever. That is why a person, if immersed enough
in a alien language needs no dictinary day-to-day - if one does
need to check, it's not the meaning you are after -
it's definition, that is MORE SYMBOLS, so you can
augment MORE RELATIONS from unfamiliar context (SYMBOL
CLOUD, think quantum mechanics and particles) until
you actually GET the pointer to "meaning" and can call on it
(how to relate that
symbol to some other symbol mesh, you can still have
no idea what the hell fermion "means", but you can use it and
fail to be misunderstood unless you want to).
I have no idea how many "syntax errors" there are
in above paragraph - for the reason sublime, my total
lack of knowledge aboot rules of grammar for the
language used to convey meaning heretofore. HTH.
P.S. It makes me wonder, what 't bony "heretofore" word
"means" right now to you, Reader. Compose witty remarks if
it's a-kind funny miss-take, I enjoy my Self when people
smirk. Yes, I did stick-in a word possessing none of it's
meaning in my poor head. I must be mad? Or contrary-wise.
I'm not sure, to be frank with you a-like Frank
Herbert iff there was such word in usage "then". She
will compensate for that - any dictionary dug
up shall (she can't help it) explain in detail or else - she always does
that when I go at a genuine miracle in open source. It's
the game we play. I need some time, we make
a beatiful team... Prop me up with another
pill! A-musing...
--
I am the One. I am A vampire A-calling for your love! A.A!
I am the fire that burns within your blood. I am the One!!
No bars or chains can keep me from your bed! I am the One!
Nothing on earth can get me from your head! I am the One!!
Now, that is a wise one - it actually helps
to comprehend my jabber in the other post O
spontaneously generated today...