I've recently released a Ruby port "Bishop" of the "Reverend" bayesian classifier written in Python. Bishop-0.3.0 is available as a Gem and from RubyForge
Bishop is a reasonably direct port of the original Python code, bug reports and suggestions for improving the structure of the code would be welcomed.
Bishop includes both Robinson and Robinson-Fisher algorithms for classification. It is presumed that they were correctly implemented in Reverend. I aim to test this in my own use of the code.
Support is included for saving/loading the trained classifier to/from YAML.
An example of using Bishop:
require 'bishop'
b = Bishop::Bayes.new b.train( "ham", "a great message from a close friend" ) b.train( "spam", "buy viagra here" ) puts b.guess( "would a friend send you a viagra advert?" )
=> [ [ "ham", <prob> ], [ "spam", <prob> ] ]
Bishop defaults to using the Robinson algorithm. To use a different algorithm construct the classifier passing a block which will call the choosen algorithm:
#I've recently released a Ruby port "Bishop" of the "Reverend" #bayesian classifier written in Python. Bishop-0.3.0 is #available as a Gem and from RubyForge # # http://rubyforge.org/projects/bishop/
hmmm, another cool filter. very small, took me less than 5 seconds to install remotely the gem.
btw, matt, how difficult or easy it it to port the bishop database to a db like postgres? I am asking since i may be querying/archiving more than 10_000 entries...
> #I've recently released a Ruby port "Bishop" of the "Reverend" > #bayesian classifier written in Python. Bishop-0.3.0 is > #available as a Gem and from RubyForge > # > # http://rubyforge.org/projects/bishop/
> hmmm, another cool filter. very small, took me less than 5 seconds to > install remotely the gem.
> btw, matt, how difficult or easy it it to port the bishop database to a db > like postgres? I am asking since i may be querying/archiving more than > 10_000 entries...
This is an excellent question. I want to use the classifier within a Rails based information aggregator I am writing to allow classification of interesting/uninteresting information and perhaps for automatic labelling.
The problem I have is that the classifier will need to be available to process each request classifying an item and each request that sorts items, i.e. quite often. This probably means initializing it once and storing it in a session variable. Since there is no concept of a session expiry callback (Rails is not an app server). The question is "How do I checkpoint the classifier as it is trained?"
At the moment I can serialize it to YAML but that takes a little time and will get slower as the training set increases. Doing the YAML conversion and a SQL update on each request is prohibitive.
I've been considering whether to have the classifier exist in a separate thread|process and allow it to checkpoint itself automatically at intervals independent of the users session behaviour. Another option was to convert the code so that everything (or nearly everything) operated directly out of the database.
Representing the pools and training data via SQL would be simple enough since it's just (word,count) tuples. Basing the code on a SQL variant might be quite attractive. The issue would be making it portable.
Since I'm using Rails anyway I could certainly attempt an ActiveRecord based variant which should satisy the Postgres requirement also.
Could be an interesting experiment. What do you think?
#Since I'm using Rails anyway I could certainly attempt an #ActiveRecord based variant which should satisy the Postgres #requirement also. # #Could be an interesting experiment. What do you think?
interesting indeed, and useful too. thanks and kind regards -botp
Would the method_missing syntax be easy to add to Bishop? Would untrain be easy to add to projects/classifier? From what I've looked at them so far, sounds like the answer to both would be yes. If they had the same API, they could go in the same module so that swapping filter types would be as simple as changing the Classifier::XXX.new line.
+1 on this question/suggestion. There may be reasons to have two different libraries, but IMVHO it would be better to have one slightly bigger library sharing APIs, services and keeping the useful differences.
> I've recently released a Ruby port "Bishop" of the "Reverend" bayesian > classifier written in Python. Bishop-0.3.0 is available as a Gem and > from RubyForge ...
> Regards,
> Matt
Hello Matt,
Thank you for this useful librbary. I am trying to use it to analyse the project of text for the european constitution (Is it social? liberal? respectful of human rights?) I am doing this for myself, just out of curiosity, there is no responsibility or any liability involved in the usage of the classifier or in the result. I'd like to know what the behaviour of the training of a classifier is when two different set of words are submitted in two successive "train" method invocations for a given category. Does the second invocation resets the training or does it accumulate the "experience" progressively.
> +1 on this question/suggestion. > There may be reasons to have two different libraries, but IMVHO it > would be better to have one slightly bigger library sharing APIs, > services and keeping the useful differences.
I thought it was about time I responded to this.
If I had known Lucas was working on his classifier library before I did the port of Reverend I probably wouldn't have bothered. However I have done it and am using it in another project of my own and have had some ideas about possible future developments.
One example is to build a version which runs directly from a SQL database (possibly using ActiveRecord). I'm also interested in new algorithms and possible improvements for support classifying RSS items within a tag space.
None of which precludes rolling Bishop and Classifier into one project.
However right now I'd like to keep control of Bishop and not be constrained from making possibly incompatible changes to the API or implementation. Similarly Lucas may have his own plans for how he wants to see Classifier develop.
I don't see the harm in having two projects and what I've suggested to Lucas is that we should compare notes periodically and see if it makes sense to merge the projects. I guess also if a lot of users of the libraries made a fuss this would affect my opinon.
> I am trying to use it to analyse the project of text for the european > constitution (Is it social? liberal? respectful of human rights?) > [..snip..] > I'd like to know what the behaviour of the training of a classifier is > when two different set of words are submitted in two successive "train" > method invocations for a given category. Does the second invocation > resets the training or does it accumulate the "experience" progressively.
You're right when you say it accumulates. Further training supplies more evidence to the classifier about which words are associated with which categories . It uses this evidence to work out conditional probabilities which are then combined to make a guess about the approriate category for an item.
There is an #untrain method if you want to remove previously trained information.
<snip all> thanks for taking time to answer, I can understand your reasons and I'm glad to know there is at least a touch beetween different hackers on similar projects, thanks both :)
> On 4/18/05, Jaypee <rf.ooda...@sd.eepyaj> wrote:
>>Thank you for this useful librbary.
> You're welcome.
>>I am trying to use it to analyse the project of text for the european >>constitution (Is it social? liberal? respectful of human rights?) >>[..snip..] >>I'd like to know what the behaviour of the training of a classifier is >>when two different set of words are submitted in two successive "train" >>method invocations for a given category. Does the second invocation >>resets the training or does it accumulate the "experience" progressively.
> You're right when you say it accumulates. Further training supplies > more evidence to the classifier about which words are associated with > which categories . It uses this evidence to work out conditional > probabilities which are then combined to make a guess about the > approriate category for an item.
> There is an #untrain method if you want to remove previously trained > information.
The subversion trunk of projects/classifier (see http://rufy.com/svn/classifier/trunk) has the untrain method in it. This will be released soon under Classifier 1.2.