Most idiomatic way to split a string into sentences by punctuation?

685 views
Skip to first unread message

Denis Papathanasiou

unread,
Jul 6, 2013, 11:56:40 AM7/6/13
to clo...@googlegroups.com
I have a plain text file containing an English-language essay I want to split into sentences, based on common punctuation.

I wrote this function, which examines a character and determines if it's an end of sentence punctuation mark:

(defn ispunc? [c]
  (> (count (filter #(= % c) '("." "!" "?" ";"))) 0))

I know this is no grammatically perfect, and that some text such as "U.S.", etc. will be mis-parsed, but this is just an experiment and I don't need that level of precision.

So I loaded my file using slurp and tried using the partition-by function with ispunc? like this:

(def my-text (slurp "mytext.txt"))
(def my-sentences (partition-by ispunc? my-text))

Unfortunately, this returns a sequence of 1, where the only element is the entire string.

So I tried splitting the string into a list of characters, and applying partition-by with ispunc? like this:

(def my-text-chars (partition (count my-text) my-text))
(def my-sentences (partition-by ispunc? (nth my-text-chars 0)))

This worked, because it is logically correct, but I get a java.lang.OutOfMemoryError when I try to access any of the elements in my-sentences (the plain text "mytext.txt" file is 1.3 mb in size).

So is there a way to do this more idiomatically, without splitting into single chars and recombining?

While 1.3 mb is not small, it's also not so large that it can't be slurped, so there must be a simpler way of splitting on punctuation into sentences.

Jim - FooBar();

unread,
Jul 6, 2013, 1:54:49 PM7/6/13
to clo...@googlegroups.com
I use this regex usually it's been a while since I last used it so I odn't remember how it performs...

#"(?<=[.!?]|[.!?][\\'\"])(?<!e\.g\.|i\.e\.|vs\.|p\.m\.|a\.m\.|Mr\.|Mrs\.|Ms\.|St\.|Fig\.|fig\.|Jr\.|Dr\.|Prof\.|Sr\.|[A-Z]\.)\s+")

and as Lars said all you need is clojure.string/split

Jim
--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Denis Papathanasiou

unread,
Jul 6, 2013, 5:05:06 PM7/6/13
to clo...@googlegroups.com


On Saturday, July 6, 2013 1:54:49 PM UTC-4, Jim foo.bar wrote:
I use this regex usually it's been a while since I last used it so I odn't remember how it performs...

#"(?<=[.!?]|[.!?][\\'\"])(?<!e\.g\.|i\.e\.|vs\.|p\.m\.|a\.m\.|Mr\.|Mrs\.|Ms\.|St\.|Fig\.|fig\.|Jr\.|Dr\.|Prof\.|Sr\.|[A-Z]\.)\s+")

and as Lars said all you need is clojure.string/split

Thanks, though as I replied to Lars, I did want to preserve the actual terminating punctuation, whatever it was, so that why I'd looked into using partition-by.

Also, sorry for the double post (I didn't realize this group was moderated, so when I didn't see the first post appear, I re-submitted it a little while later). 

Jim - FooBar();

unread,
Jul 7, 2013, 6:06:06 AM7/7/13
to clo...@googlegroups.com
I'm not sure I follow what you mean...both regexes posted here preserve the punctuation...here is mine (ignore the names - it is in fact the same regex):

hotel_nlp.concretions.artefacts=> (pprint (hotel_nlp.protocols/run reg-seg

"Statistics is closely related to probability theory, with which it is often grouped. The difference is, roughly, that probability theory starts from the given parameters of a total population to deduce probabilities that pertain to samples. Statistical inference, however, moves in the opposite direction—inductively inferring from samples to the parameters of a larger or total population!"))

["Statistics is closely related to probability theory, with which it is often grouped."
"The difference is, roughly, that probability theory starts from the given parameters of a total population to deduce probabilities that pertain to samples."
"Statistical inference, however, moves in the opposite direction—inductively inferring from samples to the parameters of a larger or total population!"]

Similar thing happens with Lars's simpler regex...just use 're-seq' instead of 'split'

Jim

Denis Papathanasiou

unread,
Jul 7, 2013, 12:29:10 PM7/7/13
to clo...@googlegroups.com


On Sunday, July 7, 2013 6:06:06 AM UTC-4, Jim foo.bar wrote:
I'm not sure I follow what you mean...both regexes posted here preserve the punctuation...here is mine (ignore the names - it is in fact the same regex):

You're right; I was actually referring to the suggestions Lars had made. 

[snip]


Similar thing happens with Lars's simpler regex...just use 're-seq' instead of 'split'

That wasn't my experience:

#'user/sentences
user=> (nth sentences 0)
"    THE country of the ancient Mexicans, or Aztecs as they were called, formed but a very small part of the extensive territories comprehended in the modern republic of Mexico"
user=> (nth sentences 1)
" Its boundaries cannot be defined with certainty"
user=> (nth sentences 2)
" They were much enlarged in the latter days of the empire, when they may be considered as reaching from about the eighteenth degree north to the twenty-first on the Atlantic"
 
Actually, I also thought of a way to do it with the simple example suggested by Lars w/o using the nlp package (this only works b/c there are no pipe characters in the text file I'm processing):

user=> (def sentences (clojure.string/split(clojure.string/replace my-text #"([.?!;])\s{1}" "$1|||") #"\|\|\|"))
#'user/sentences
user=> (nth sentences 0)
"    THE country of the ancient Mexicans, or Aztecs as they were called, formed but a very small part of the extensive territories comprehended in the modern republic of Mexico."
user=> (nth sentences 1)
"Its boundaries cannot be defined with certainty."
user=> (nth sentences 2)
"They were much enlarged in the latter days of the empire, when they may be considered as reaching from about the eighteenth degree north to the twenty-first on the Atlantic;"
user=> (nth sentences 3)
"and from the fourteenth to the nineteenth, including a very narrow strip, on the Pacific."

Cedric Greevey

unread,
Jul 7, 2013, 12:35:43 PM7/7/13
to clo...@googlegroups.com
On Sun, Jul 7, 2013 at 12:29 PM, Denis Papathanasiou <denis.pap...@gmail.com> wrote:


On Sunday, July 7, 2013 6:06:06 AM UTC-4, Jim foo.bar wrote:
I'm not sure I follow what you mean...both regexes posted here preserve the punctuation...here is mine (ignore the names - it is in fact the same regex):

You're right; I was actually referring to the suggestions Lars had made. 

[snip]


Similar thing happens with Lars's simpler regex...just use 're-seq' instead of 'split'

That wasn't my experience:

#'user/sentences
user=> (nth sentences 0)
"    THE country of the ancient Mexicans, or Aztecs as they were called, formed but a very small part of the extensive territories comprehended in the modern republic of Mexico"
user=> (nth sentences 1)
" Its boundaries cannot be defined with certainty"
user=> (nth sentences 2)
" They were much enlarged in the latter days of the empire, when they may be considered as reaching from about the eighteenth degree north to the twenty-first on the Atlantic"
 
Actually, I also thought of a way to do it with the simple example suggested by Lars w/o using the nlp package (this only works b/c there are no pipe characters in the text file I'm processing):

[snip]

Some people, when confronted with a problem, think “I know, I'll use regular expressions.”  Now they have two problems. :)
Reply all
Reply to author
Forward
0 new messages