Best way to count frequency of words in a paragraph?

3,465 views
Skip to first unread message

Ken McDonald

unread,
Mar 18, 2012, 4:16:40 PM3/18/12
to scala...@googlegroups.com
I can't seem to wrap my mind around the problem of counting how many times a word occurs in a paragraph. That is, roughly, a function Seq[String] => Map[String, Int]. I feel like it should be a one liner, but all the "obvious" solutions to me are considerably longer. Thanks for any advice.

Thanks,
Ken

HamsterofDeath

unread,
Mar 18, 2012, 4:25:46 PM3/18/12
to scala...@googlegroups.com
from brain to keyboard:
string.split(" ").groupby(e => e).map(e => e._1 -> e._2.length)

Luke Vilnis

unread,
Mar 18, 2012, 5:08:36 PM3/18/12
to HamsterofDeath, scala...@googlegroups.com
FYI, you can also use mapValues(_.length) 

ps. funny, I was just doing this exact thing. are you doing NLP?

Michael Schmitz

unread,
Mar 18, 2012, 6:19:59 PM3/18/12
to Luke Vilnis, HamsterofDeath, scala...@googlegroups.com
I'd love it if Scala had a multiset/bag implementation.

Peace. Michael

Rex Kerr

unread,
Mar 18, 2012, 6:38:46 PM3/18/12
to HamsterofDeath, scala...@googlegroups.com
Probably want split(" +") so that you don't count "Hi.  Bye." as three words.  Or you could filter out things that are words somehow.  (String.split(" ").filter(_ => ???)....)

  --Rex

On Sun, Mar 18, 2012 at 4:25 PM, HamsterofDeath <h-s...@gmx.de> wrote:

Daniel Sobral

unread,
Mar 18, 2012, 10:45:44 PM3/18/12
to HamsterofDeath, scala...@googlegroups.com
On Sun, Mar 18, 2012 at 17:25, HamsterofDeath <h-s...@gmx.de> wrote:
> from brain to keyboard:
> string.split(" ").groupby(e => e).map(e => e._1 -> e._2.length)

string.split("\\W+").groupBy(identity).mapvalues(_.length)

Though, perhaps, \P{Alpha}+ woudl be better than \W+, as it would get
rid of numbers. It really comes down to what exactly do you want. When
it comes to regex, you really should be precise in what you mean.

>
> Am 18.03.2012 21:16, schrieb Ken McDonald:
>> I can't seem to wrap my mind around the problem of counting how many
>> times a word occurs in a paragraph. That is, roughly, a function
>> Seq[String] => Map[String, Int]. I feel like it should be a one liner,
>> but all the "obvious" solutions to me are considerably longer. Thanks
>> for any advice.
>>
>> Thanks,
>> Ken
>

--
Daniel C. Sobral

I travel to the future all the time.

Dennis Haupt

unread,
Mar 19, 2012, 4:50:13 AM3/19/12
to Luke Vilnis, scala...@googlegroups.com
NLP?

-------- Original-Nachricht --------
> Datum: Sun, 18 Mar 2012 17:08:36 -0400
> Von: Luke Vilnis <lvi...@gmail.com>
> An: HamsterofDeath <h-s...@gmx.de>
> CC: scala...@googlegroups.com
> Betreff: Re: [scala-user] Best way to count frequency of words in a paragraph?

Kevin Wright

unread,
Mar 19, 2012, 5:39:06 AM3/19/12
to Dennis Haupt, Luke Vilnis, scala...@googlegroups.com
Natural Language Processing

Ken McDonald

unread,
Mar 22, 2012, 12:07:54 PM3/22/12
to scala...@googlegroups.com, HamsterofDeath
Thanks, everyone, for the replies. You even designed my regex for me, which is above and beyond the call of duty!

Ken

Ken McDonald

unread,
Mar 22, 2012, 12:12:12 PM3/22/12
to scala...@googlegroups.com, HamsterofDeath
Not in the way I suspect you mean it. I'm trying to prototype an idea I have for helping people to expand their vocabulary of a language they're learning. This is only NLP if you consider a Lada to have been a car.

Michael Schmitz

unread,
Mar 22, 2012, 6:43:20 PM3/22/12
to Ken McDonald, scala...@googlegroups.com, HamsterofDeath
You'd be surprised what qualifies for NLP.

hereins...@gmx.de

unread,
Mar 23, 2012, 11:03:42 AM3/23/12
to scala...@googlegroups.com
string.split("\P{Alpha}+").groupBy(identity).mapValues(_.length) is great but how can I also match words like we'll, I'd, wouldn't etc?


-------- Original-Nachricht --------
> Datum: Thu, 22 Mar 2012 09:07:54 -0700 (PDT)
> Von: Ken McDonald <ykke...@gmail.com>
> An: scala...@googlegroups.com
> CC: HamsterofDeath <h-s...@gmx.de>


> Betreff: Re: [scala-user] Best way to count frequency of words in a paragraph?

> Thanks, everyone, for the replies. You even designed my regex for me,

--
Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de

√iktor Ҡlang

unread,
Mar 23, 2012, 11:13:15 AM3/23/12
to hereins...@gmx.de, scala...@googlegroups.com
http://docs.oracle.com/javase/tutorial/essential/regex/
--
Viktor Klang

Akka Tech Lead
Typesafe - The software stack for applications that scale

Twitter: @viktorklang

Edmondo Porcu

unread,
Mar 23, 2012, 11:15:43 AM3/23/12
to √iktor Ҡlang, hereins...@gmx.de, scala...@googlegroups.com
Regexp have historically been one of the most tedious part to learn in
programming.

We are looking for a volounter happy to code a DSL in scala to create regexp :))

Edmondo

2012/3/23 √iktor Ҡlang <viktor...@gmail.com>:

√iktor Ҡlang

unread,
Mar 23, 2012, 11:22:24 AM3/23/12
to Edmondo Porcu, hereins...@gmx.de, scala...@googlegroups.com


2012/3/23 Edmondo Porcu <edmond...@gmail.com>

Regexp have historically been one of the most tedious part to learn in
programming.

Learning is what a programmer does.
 

We are looking for a volounter happy to code a DSL in scala to create regexp :))

Good luck with that ;-)

Cheers,

Derek Williams

unread,
Mar 23, 2012, 11:34:44 AM3/23/12
to Edmondo Porcu, √iktor Ҡlang, hereins...@gmx.de, scala...@googlegroups.com
2012/3/23 Edmondo Porcu <edmond...@gmail.com>

We are looking for a volounter happy to code a DSL in scala to create regexp :))


Not that it would help Ken out... since I'm sure he's quite aware of it :)

--
Derek Williams

virtualeyes

unread,
Mar 23, 2012, 12:17:13 PM3/23/12
to scala-user
Neuro-Linguistic Programming, AKA, NLP

On Mar 19, 10:39 am, Kevin Wright <kev.lee.wri...@gmail.com> wrote:
> Natural Language Processing
>
> On 19 March 2012 08:50, Dennis Haupt <h-s...@gmx.de> wrote:
>
>
>
>
>
>
>
> > NLP?
>
> > -------- Original-Nachricht --------
> > > Datum: Sun, 18 Mar 2012 17:08:36 -0400
> > > Von: Luke Vilnis <lvil...@gmail.com>

Luke Vilnis

unread,
Mar 23, 2012, 1:08:06 PM3/23/12
to virtualeyes, scala-user

hereins...@gmx.de

unread,
Mar 23, 2012, 3:35:34 PM3/23/12
to "√iktor Ҡlang", scala...@googlegroups.com
Thanks for the link. After reading it I tried the following but it does not give me the expected result. Can anyone elaborate on what is the problem?

('?\\P{Alpha}+)|(\\P{Alpha}+'?\\P{Alpha}+)|(\\P{Alpha}+'?)|(\\P{Alpha}+)

-------- Original-Nachricht --------
> Datum: Fri, 23 Mar 2012 16:13:15 +0100
> Von: "√iktor Ҡlang" <viktor...@gmail.com>
> CC: scala...@googlegroups.com

> Viktor Klang
>
> Akka Tech Lead

> Typesafe <http://www.typesafe.com/> - The software stack for applications
> that scale
>
> Twitter: @viktorklang
--
NEU: FreePhone 3-fach-Flat mit kostenlosem Smartphone!
Jetzt informieren: http://mobile.1und1.de/?ac=OM.PW.PW003K20328T7073a

Daniel Sobral

unread,
Mar 23, 2012, 5:54:13 PM3/23/12
to Edmondo Porcu, √iktor Ҡlang, hereins...@gmx.de, scala...@googlegroups.com
2012/3/23 Edmondo Porcu <edmond...@gmail.com>:

> Regexp have historically been one of the most tedious part to learn in
> programming.
>
> We are looking for a volounter happy to code a DSL in scala to create regexp :))

https://github.com/KenMcDonald/rex

Daniel Sobral

unread,
Mar 23, 2012, 6:02:49 PM3/23/12
to hereins...@gmx.de, scala...@googlegroups.com
On Fri, Mar 23, 2012 at 12:03, <hereins...@gmx.de> wrote:
> string.split("\P{Alpha}+").groupBy(identity).mapValues(_.length) is great but how can I also match words like we'll, I'd, wouldn't etc?

That's harder. You could use [^A-Za-z'], which would catch these
words, but would also pick up ' used in other contexts. It also ignore
non-ASCII alphabetic characters. It might be possible to come up with
some other split pattern, but I'm guessing it would be annoyingly
hard. The whole point of using \P{Alpha} is to pick everything that
is *NOT* alphanumeric, which mirrors what split is doing: identifying
everything that is not what you want.

A better solution would be to use findAllIn with a pattern that
describes what words look like, in which case would could have
something like (\p{Alpha}(('\p{Alpha})|\p{Alpha})*).

hereins...@gmx.de

unread,
Mar 23, 2012, 6:36:49 PM3/23/12
to Daniel Sobral, scala...@googlegroups.com
[^A-Za-z'] works well enough for the moment. But I'm somewhat confused now. According to http://docs.oracle.com/javase/tutorial/essential/regex/char_classes.html:
The "^" metacharacter means negation. So, how is it possible then to get a match with this regex?

I did some testing on the REPL:
"bat".split("[^bcr]at")
res165: Array[java.lang.String] = Array(bat)

"cat".split("[^bcr]at")
res166: Array[java.lang.String] = Array(cat)

"hat".split("[^bcr]at")
res167: Array[java.lang.String] = Array()

Well, this is quite the opposite behavior to what is stated on Javas Regex tutorial website:

They say: "To match all characters except those listed, insert the "^" metacharacter at the beginning of the character class. This technique is known as negation.

Enter your regex: [^bcr]at
Enter input string to search: bat
No match found.

Enter your regex: [^bcr]at
Enter input string to search: cat
No match found.

Enter your regex: [^bcr]at
Enter input string to search: hat
I found the text "hat" starting at index 0 and ending at index 3."

So, what am I missing here? Can someone enlighten me?


-------- Original-Nachricht --------
> Datum: Fri, 23 Mar 2012 19:02:49 -0300
> Von: Daniel Sobral <dcso...@gmail.com>
> CC: scala...@googlegroups.com

Jean-Francois Im

unread,
Mar 23, 2012, 6:45:21 PM3/23/12
to hereins...@gmx.de, Daniel Sobral, scala...@googlegroups.com
No, it behaves as expected. You're doing a string split based on a
regular expression.

Your first and second ones do not match anything, so the string itself
is returned. The third one matches everything in the string, so there
is nothing left.

Try something like "bathatcat".split("[^bcr]at")

Daniel Sobral

unread,
Mar 24, 2012, 10:08:07 AM3/24/12
to hereins...@gmx.de, scala...@googlegroups.com
On Fri, Mar 23, 2012 at 19:36, <hereins...@gmx.de> wrote:
> [^A-Za-z'] works well enough for the moment. But I'm somewhat confused now. According to http://docs.oracle.com/javase/tutorial/essential/regex/char_classes.html:
> The "^" metacharacter means negation. So, how is it possible then to get a match with this regex?
>
> I did some testing on the REPL:
> "bat".split("[^bcr]at")
> res165: Array[java.lang.String] = Array(bat)
>
> "cat".split("[^bcr]at")
> res166: Array[java.lang.String] = Array(cat)
>
> "hat".split("[^bcr]at")
> res167: Array[java.lang.String] = Array()
>
> Well, this is quite the opposite behavior to what is stated on Javas Regex tutorial website:
>
> They say: "To match all characters except those listed, insert the "^" metacharacter at the beginning of the character class. This technique is known as negation.
>
> Enter your regex: [^bcr]at
> Enter input string to search: bat
> No match found.
>
> Enter your regex: [^bcr]at
> Enter input string to search: cat
> No match found.
>
> Enter your regex: [^bcr]at
> Enter input string to search: hat
> I found the text "hat" starting at index 0 and ending at index 3."
>
> So, what am I missing here? Can someone enlighten me?

You are using split. Split doesn't return the matching strings (that's
what the findAllIn suggestion would do). Split *removes* all matching
strings, and returns an array of the remaining strings, broken up at
the point where the removal happens.

hereins...@gmx.de

unread,
Mar 28, 2012, 7:55:35 AM3/28/12
to scala...@googlegroups.com
Hi,

I'm playing around with Actors and having problems figuring out why method fetch is running only one time. Does anyone know why?

Here is the code...

import scala.xml._
import XML._
import scala.actors._
import Actor._
import java.net.URL

object RssFetch extends App {

val messenger = actor {
loop {
react {
case Response(msg: String) => println(msg)
}
}
}

val rssFetcher = new Fetcher(messenger)
rssFetcher.start
}

case class FetchFeeds()
case class InitFeeder()
case class Response(title: String)

class Fetcher(messenger: Actor) extends Actor {

this ! InitFeeder

private def periodicFetch() {
val feeder = self
actor {
loop {
println("Starting periodic fetching...")
Thread.sleep(3000)
feeder ! FetchFeeds
}
}
}

def act {
loop {
react {
case FetchFeeds => fetch()
case InitFeeder => periodicFetch()
}
}
}

def fetch(): Unit = {
val rssFeed = XML.load(new URL("http://www.google.com/news?pz=1&cf=all&ned=us&hl=en&output=rss").openConnection.getInputStream)
val items = rssFeed \ "channel" \ "item"
for {
title <- items \ "title"
} messenger ! Response(title text)

Reply all
Reply to author
Forward
0 new messages