Using MutualInformationFeatureSelectionExtractor

6 views
Skip to first unread message

Gully Burns

unread,
Jul 22, 2014, 4:06:51 PM7/22/14
to cleartk-d...@googlegroups.com
Hey everyone, 

I'm running an analysis of Mutual Information across a corpus but the process of executing the extractor.train(instances) function is taking a very long time to execute. After digging around inside the code, I noticed that the system is actually very slow when sorting the list of mutual information values. Is this because the extractor is computing the mutual information score within each comparison step? 

I have a corpus of about 14,000 full text documents and this is killing my performance.

Am I right about this? Is there some way of caching the output effectively?

Gully   

Philip Ogren

unread,
Jul 22, 2014, 5:07:30 PM7/22/14
to cleartk-d...@googlegroups.com
It has been a few months since I looked at this code, but this sounds right to me.  I remember that there is a unacceptable performance bug in this code that I didn't make time to uncover (I didn't write the code, just added some basic unit tests to it!)  I like the idea of having trainable extractors and I think our first pass at the abstraction/API is good (your feedback is welcome, however) - but I think the actual implementation of the first set of trainable feature extractors is pretty immature.  Thank you for bravely giving them a try!  We would be happy to incorporate any fixes you come up with!
--
You received this message because you are subscribed to the Google Groups "cleartk-developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cleartk-develop...@googlegroups.com.
To post to this group, send email to cleartk-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/cleartk-developers.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages