[ANN| Bayesian Classification for Ruby

Lucas Carlson

unread,

Apr 11, 2005, 2:53:26 AM4/11/05

to

I would like to announce a new module called Classifier for Ruby. It is
available from:

http://rubyforge.org/projects/classifier/

or simply

gem install classifier

With it, you can do things like:

===
require 'classifier'
b = Classifier::Bayes.new 'Interesting', 'Uninteresting' # supports any
number of categories of any name
b.train_interesting "here are some good words. I hope you love them"
b.train_uninteresting "here are some bad words, I hate you"
b.classify "I hate bad words and you" # returns 'Uninsteresting'
===

Or if you would like persistence:

===
require 'classifier'
require 'madeleine'
m = SnapshotMadeleine.new("bayes_data") {
Classifier::Bayes.new 'Interesting', 'Uninteresting'
}
m.system.train_interesting "here are some good words. I hope you love
them"
m.system.train_uninteresting "here are some bad words, I hate you"
m.take_snapshot
m.system.classify "I love you" # returns 'Interesting'
===

Please send me any feedback about this library, including how you plan
to use it or extend it.

Thank you!
-Lucas Carlson
http://rufy.com/

Jamis Buck

unread,

Apr 11, 2005, 10:11:36 AM4/11/05

to

On Apr 11, 2005, at 7:53 AM, Florian Groß wrote:

> Lucas Carlson wrote:
>
>> I would like to announce a new module called Classifier for Ruby.

>> With it, you can do things like:
>> ===
>> require 'classifier'
>> b = Classifier::Bayes.new 'Interesting', 'Uninteresting' # supports
>> any
>> number of categories of any name
>> b.train_interesting "here are some good words. I hope you love them"
>> b.train_uninteresting "here are some bad words, I hate you"
>> b.classify "I hate bad words and you" # returns 'Uninsteresting'
>> ===
>

> This is wonderful and might make a nice addition to Rails software
> that already offers manual tagging and/or categorization which is
> quite a common thing to have. Perhaps it would be a good idea to also
> announce it over there.
>
> I don't know if this is already possible, but b.train(:interesting,
> ...) would make an interesting alternative API which would be more
> flexible.
>

+1. I'd like to see a more general #train API as well, but that's a
minor quibble. Thanks, Lucas, for this lib! I've been wanting something
like this for a while now. :)

- Jamis

Florian Groß

unread,

Apr 11, 2005, 9:53:27 AM4/11/05

to

Lucas Carlson wrote:

> I would like to announce a new module called Classifier for Ruby.
>

> With it, you can do things like:
>
> ===
> require 'classifier'
> b = Classifier::Bayes.new 'Interesting', 'Uninteresting' # supports any
> number of categories of any name
> b.train_interesting "here are some good words. I hope you love them"
> b.train_uninteresting "here are some bad words, I hate you"
> b.classify "I hate bad words and you" # returns 'Uninsteresting'
> ===

This is wonderful and might make a nice addition to Rails software that

Williams, Chris

unread,

Apr 11, 2005, 10:47:21 AM4/11/05

to

> +1. I'd like to see a more general #train API as well, but that's a
> minor quibble. Thanks, Lucas, for this lib! I've been wanting
something
> like this for a while now. :)
>
> - Jamis
>

+1 for me as well. I actually just started writing a Bayesian Classifier
yesterday, so this should save me some work!

Chris

Matt Mower

unread,

Apr 11, 2005, 2:06:07 PM4/11/05

to

On Apr 11, 2005 7:54 AM, Lucas Carlson <lu...@rufy.com> wrote:
> I would like to announce a new module called Classifier for Ruby. It is
> available from:
>
> http://rubyforge.org/projects/classifier/
>

;-)

I ported the Reverend bayesian classifier from Python to Ruby over the
weekend. If only I'd waited ;-)

M

--
Matt Mower :: http://matt.blogs.it/

Lucas Carlson

unread,

Apr 11, 2005, 3:36:36 PM4/11/05

to

Due to popular demand, #train has been added. If you are using gem, try
gem update classifier. Now you can do anything from:

b.train "Interesting", "here are some good words. I hope you love them"

to

b.train :Interesting, "here are some good words. I hope you love them"

Also, lowercase categories and categories with spaces are now supported.

Tom Reilly

unread,

Apr 11, 2005, 9:47:26 PM4/11/05

to

I happened to notice your posting about classifier.

My problem is this and I wonder if your program would be useful.

I am a MD taking care of nursing home patients. I wrote a data base program
to keep track of all of the phone calls we get. We have used the
program for 2
years. We have over 80,000 phone records which contain the problem
about which
the nursing home called and the recommended treatment.

It occurred to me that given these messages, there ought to be some way
that they
could be classified according to problem type and the summary could be used
to determine what problems a given nursing home is not handling very well.

Using Hash.new, I determined that there are about 22,000 words some
abbreviations,
some correct spellings, some others incorrect.. There are on the
average of 20 words
per message though many of the words are adjitives, prepositions, verbs
which don't
help classifications.

Using a Levenshtein Distance algorithm for the larger words, it does a
pretty
good job of eliminating misspellings though it works quite poorly on 3,
4, and 5 character
words.

Determine Levenshtein distance of two strings

def Ld(s,t)
n = s.size
m = t.size
a = Array.new

if n != 0 && m != 0

#2 create array
r = Array.new
rz = Array.new

0.upto(m) {|x| r.push(0)}

0.upto(n) {|x|a.push(r.dup)}
a.each_index {|x| a[x][0] = x}
0.upto(m) {|x| a[0][x] = x}

#a.each {|x| p x}

cost = 0
1.upto(n) do |i|
1.upto(m) do |j|
if s[i] == t[j]
cost =0
else
cost = 1
end
a[i][j] = [a[ i- 1][j] +1,a[i][j - 1] + 1,a[i - 1][j -
1] + cost].min
end
end
a[n][m]
#a.each {|x| p x}
else
0
end
end

I'd appreciate any comments you might have.

Thanks

Tom Reilly.

Bob Aman

unread,

Apr 11, 2005, 10:05:11 PM4/11/05

to

> Please send me any feedback about this library, including how you plan
> to use it or extend it.

I think I'm in love. Not sure what I'll do with it yet, but I'm sure
I'll dream something up!
--
Bob Aman

David Garamond

unread,

Apr 11, 2005, 10:09:06 PM4/11/05

to

I'd even suggest removing the individual #train_... methods. It makes
the API simpler, and how many characters do they save anyway. Plus
consider these use cases: 1) category names are changed; 2) name of
categories contain whitespaces, etc; 3) there are 1000+ categories.

--
dave

Glenn Parker

unread,

Apr 11, 2005, 10:59:06 PM4/11/05

to

David Garamond wrote:
>
> I'd even suggest removing the individual #train_... methods.

+1. BTW, I think this is a nifty tool.

--
Glenn Parker | glenn.parker-AT-comcast.net | <http://www.tetrafoil.com/>

"Peña, Botp"

unread,

Apr 11, 2005, 11:32:01 PM4/11/05

to

Matt Mower [mailto:matt....@gmail.com] wrote:

#> http://rubyforge.org/projects/classifier/
#>
#
#;-)
#
#I ported the Reverend bayesian classifier from Python to Ruby
#over the weekend. If only I'd waited ;-)

did you finished the port? Are you using it?

i'm asking since i'm using outlook spambayes wc is pure python....

kind regards -botp

#
#M
#
#--
#Matt Mower :: http://matt.blogs.it/
#

Lucas Carlson

unread,

Apr 12, 2005, 12:59:03 AM4/12/05

to

> I'd even suggest removing the individual #train_... methods. It makes

> the API simpler, and how many characters do they save anyway. Plus
> consider these use cases: 1) category names are changed; 2) name of
> categories contain whitespaces, etc; 3) there are 1000+ categories.

1) Category names can't change, but even if they could, this is
implemented via method_missing
2) I have elegantly handled white spaces in category names
3) This is implemented via method_missing, not define_method, so
objects don't get bloated

Matt Mower

unread,

Apr 12, 2005, 3:47:47 AM4/12/05

to

On Apr 12, 2005 4:32 AM, "Peña, Botp" <bo...@delmonte-phil.com> wrote:
> Matt Mower [mailto:matt....@gmail.com] wrote:
>
> #> http://rubyforge.org/projects/classifier/
> #>
> #
> #;-)
> #
> #I ported the Reverend bayesian classifier from Python to Ruby
> #over the weekend. If only I'd waited ;-)
>
> did you finished the port? Are you using it?
>

Yes I finished porting the code and it seems to work for the Robinson
method although what I haven't done yet is verify that it produces the
same results as Reverend, i.e. have I ported it properly? Using the
Robinson-Fisher method doesn't seem to work so I think I need to go
back and look at my implementation of that algorithm again.

Here is Lucas' example using Bishop:

require 'bishop'

# The block passed here is actually mimicing the default behaviour however
# I add it here to show how the probability combiner algorithm is replaced,
# e.g. with Bishop::robinson_fisher
b = Bishop::Bayes.new { |probs,ignore| Bishop::robinson( probs, ignore ) }

b.train( "interesting", "here are some good words. I hope you love them" )
b.train( "uninteresting", "here are some bad words, I hate you" )
b.guess( "I hate bad words and you" ).each { |c| puts c.join( " = " ) }

outputs

uninteresting = 0.9999

Is there anyone with some python savvy who'd like to collaborate on testing?

Florian Groß

unread,

Apr 12, 2005, 7:47:03 AM4/12/05

to

Tom Reilly wrote:

> Using Hash.new, I determined that there are about 22,000 words some
> abbreviations, some correct spellings, some others incorrect.. There
> are on the average of 20 words per message though many of the words
> are adjitives, prepositions, verbs which don't help classifications.

Regarding spelling mistakes: Giving enough overlap between the correct
and incorrect word that will not be a problem. The Thunderbird Spam
filter has learned to deal with the on purpose misspellings and abusing
of spam senders over the course of time. I think it works like this:

Spam Message A: Deve|oped Commercia|ized Price
Spam Message B: Pr1ce Commercia|ized
Spam Message C: Developed Commercialized Pr1ce
Spam Message D: Developed Commercialized Price

It will see that there is quite some overlap between those messages and
when it classifies one as spam it will also learn new data from that
message which will make it adapt given enough data.

It is however a good idea to examine a good amount of the results of its
classifying of the data and to manually correct them if necessary.

Andreas Schwarz

unread,

Apr 12, 2005, 8:00:28 AM4/12/05

to

Lucas Carlson wrote:

> Please send me any feedback about this library, including how you plan
> to use it or extend it.

A suggestion: it would be nice if you could use arrays of strings or
symbols to train/classify data.

Martin Ankerl

unread,

Apr 12, 2005, 9:46:10 AM4/12/05

to

Thanks a lot!

martinus

Lucas Carlson

unread,

Apr 12, 2005, 6:18:55 PM4/12/05

to

some_array.each { |str| b.train_interesting str }

Dave Brown

unread,

Apr 13, 2005, 10:04:13 AM4/13/05

to

"Lucas Carlson" <lu...@rufy.com> writes:
> I would like to announce a new module called Classifier for Ruby. It is
> available from:
>
> http://rubyforge.org/projects/classifier/
>
> or simply
>
> gem install classifier

Now this looks interesting. I'll have to have a look at it and
maybe update the gurgitate-mail documentation to mention it.

--Dave