training - input documents format

Skip to first unread message

Jul 8, 2015, 12:18:22 PM7/8/15

I am learning to use word2vec to train word vectors. I have noticed that the demo takes a single raw text document, which seems to be the concatenation of lots of individual documents.

So suppose my corpus has 1 million documents, does it mean that I need to extract raw text content from these documents and concatenate them into a single one to pass to word2vec?

Also, does it matter if the text content is normalized or not? For example, I noticed that the example input file to has no punctuations and all words are lowercase. But the input file to seems to contain unnormalized text content.

Many thanks!

ziqi zhang

Jul 9, 2015, 10:59:27 AM7/9/15
Hi all

I understand this may be a trivial question but I have looked around and I could not find explicit answer but got some hints by checking the list of training data at I cannot understand C so I also cannot read it from the code. So I would very very much appreciate some suggestions to this question!

so far here are my findings that confused me:
- data "text8" downloaded by "": single file that seems to concatenate multiple documents. no punctuation, all lower case
- data I downloaded from "Dataset from "One Billion Word Language Modeling Benchmark" Almost 1B words, already pre-processed text.": multiple files, each contains punctuations and capitalization.

Which one of the training data input format is right, or are the both ok?

many thanks!

Tomas Mikolov

Jul 9, 2015, 11:14:48 AM7/9/15
There is no universal answer. It all depends on what you plan to use the vectors for. In my experience, it is usually good to disconnect (or remove) punctuation from words, and sometimes also convert all characters to lowercase. One can also replace all numbers (possibly greater than some constant) with some single token such as <NUM>.

All these pre-processing steps aim to reduce the vocabulary size without removing any important content (which in some cases may not be true when you lowercase certain words, ie. 'Bush' is different than 'bush', while 'Another' has usually the same sense as 'another'). The smaller the vocabulary is, the lower is the memory complexity, and the more robustly are the parameters for the words estimated. You also have to pre-process the test data in the same way.

To see how text from the 1B Word Benchmark can be normalized, see:

In short, you will understand all this much better if you will run experiments.

ziqi zhang

Jul 9, 2015, 11:42:53 AM7/9/15
Thanks Tomas for a very nice example that answers all my questions!


Jul 10, 2015, 3:15:10 AM7/10/15
I find deciding what to do with numbers the most challenging. Removing them entirely gives good results but there is a sense that good information has been left on the table. Leaving the numbers in sometimes degrades the results and sometimes improves them. Use of <NUM> as a catch-all sounds interesting but isn't that the same as leaving them out or is the idea that it falls somewhere in between all-out and all-in?

ziqi zhang

Jul 16, 2015, 4:44:17 AM7/16/15
I think it depends on your needs and your data.

If your data are number-centric and numbers really explain a lot about the information then you may want to keep the numbers or even normalise into discrete ranges (again depending on your needs). but if you want focus on words and numbers are not prevalent, I think the idea is that when you have a really huge training dataset (e.g., billions of words), the data will just 'explain itself' and you need not to worry about the noise or less important information that is lost, In other words, how you normalise the numbers may not matter much to your results. 

But if you train on a very small corpus e.g., a couple thousands of words, then you may notice the difference.
Reply all
Reply to author
0 new messages