Realistic Corpus size

12 views

Skip to first unread message

David Webb

unread,

Jul 19, 2011, 4:33:32 PM7/19/11

to s-spac...@googlegroups.com

I have about 2.5million documents that I can analyze with LSA. I currently have a test sspace that I generated from 10K of those documents.

Is there a magic number of documents to analyze, where the return diminishes? By return, I mean the time to generate, load, or query the sspace file.

Assume that all 2.5 million are similar documents (resumes) and a whatever sample size I choose should provide a good representative sample of the entire set.

Thanks.

Reply all

Reply to author

Forward

0 new messages