Realistic Corpus size

12 views
Skip to first unread message

David Webb

unread,
Jul 19, 2011, 4:33:32 PM7/19/11
to s-spac...@googlegroups.com
I have about 2.5million documents that I can analyze with LSA.  I currently have a test sspace that I generated from 10K of those documents.

Is there a magic number of documents to analyze, where the return diminishes?  By return, I mean the time to generate, load, or query the sspace file.

Assume that all 2.5 million are similar documents (resumes) and a whatever sample size I choose should provide a good representative sample of the entire set.

Thanks.
Reply all
Reply to author
Forward
0 new messages