Hi Jin,
I'm wondering then what is your use-case, if CommonCrawl is too small? It's been successfully used in many NLP tasks (training word embeddings using Glove -
https://github.com/stanfordnlp/GloVe; machine translation - Smith et al., 2013, language models - Buck et al., 2014). You mentioned "language reference tool" - then you should also look at the Web as a corpus (WAC) initiative and similar projects, see e.g. Schäfer, & Bildhauer, 2013.
Best,
Ivan
References:
Smith, J. R., Saint-Amand, H., Plamada, M., Koehn, P., Callison-Burch, C., & Lopez, A. (2013). Dirt Cheap Web-Scale Parallel Text from the Common Crawl. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1374–1383). Sofia, Bulgaria: Association for Computational Linguistics. Retrieved from
http://www.aclweb.org/anthology/P13-1135Buck, C., Heafield, K., & van Ooyen, B. (2014). N-gram Counts and Language Models from the Common Crawl. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, … S. Piperidis (Eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14) (pp. 3579–3584). Reykjavik, Iceland: European Language Resources Association (ELRA). Retrieved from
http://www.lrec-conf.org/proceedings/lrec2014/pdf/1097_Paper.pdfSchäfer, R., & Bildhauer, F. (2013). Web Corpus Construction. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers. doi:10.2200/S00508ED1V01Y201305HLT022