I am learning to use word2vec to train word vectors. I have noticed that the demo takes a single raw text document, which seems to be the concatenation of lots of individual documents.
So suppose my corpus has 1 million documents, does it mean that I need to extract raw text content from these documents and concatenate them into a single one to pass to word2vec?
Also, does it matter if the text content is normalized or not? For example, I noticed that the example input file to demo-word.sh has no punctuations and all words are lowercase. But the input file to demo-phrases.sh seems to contain unnormalized text content.