1. What are the best practices in data preparation for doc2vec?
2. I know it depends on data but generally which one is better? Keeping the punctuations or removing them?
3. How do stop words impact the doc2vec model. Unlike lda, will doc2vec model benefit from stop words?
4. Does stemming help in improving the model?
I have made a doc2vec model on wikipedia dump and it works good. When I search for 'Artificial intelligence' , it gives me words that are related to artificial intelligence. But when i search for 'artificial intelligence' , it fails. This is because Artificial intelligence is present in vocab not artificial intelligence. Is there a way where i can convert doc2vec vocab that i made into lowercase, remove - between words and replace them with space. This would be helpful because when user will enter anything, i will first convert string to lowercase, remove - and replace it with space , then search in vocab to get relevant keywords.