Hey everyone,
I'm running an analysis of Mutual Information across a corpus but the process of executing the extractor.train(instances) function is taking a very long time to execute. After digging around inside the code, I noticed that the system is actually very slow when sorting the list of mutual information values. Is this because the extractor is computing the mutual information score within each comparison step?
I have a corpus of about 14,000 full text documents and this is killing my performance.
Am I right about this? Is there some way of caching the output effectively?
Gully