Good day everybody. I've been looking at the forum for a solution from my question but I keep unsure about two things on used Doc2Vec with Paragraph IDs
If I understood it well, Word2Vec make good use of the context of the words, and it needs to have the words repeated in many places to create "a context", in such a way, it 'finds' the semantics of words and place together Queen and King for example (just because they tend to appear in sentences as the King had a crown, the Queen was crowned" ... many times).
What I can't understand it is why the tags for the documents can be unique, so a Paragraph ID would be Doc0001,Doc0002,...,Doc9999
To express better my question I am using Doc2Vec in three ways but getting sort of bad results and I can't see what would be the normal way of doing it.
In one case my documents, genomes as if they were a really long sentence with low repetitions and often sections that are repeated between documents (called operons), have an unique ID and D2V place those that I expect to be sort of similar not so close.
The second one, I have those genomes broken in parts (like it is a long sentences, it may break in random sections), then for the D2V each section has the same tag but quite different content. This one is really scattered in a T-SNE plot.
The third that I want to try but I do not know if it make sense, would be a kind of sentiment analysis / collaborative filtering where I rather than put the ID I put the characteristics, for example "Photosynthetic".
Does anyone can give me a clue if the unique ID is thought for putting together thousands of texts (so thousands of unique IDs). it sort of make sense to me that the tags must be repeated?
On the other hand, anyone knows what can happen in cases where most of the sentences is a sort of random order, except regions of it?
Thanks a lot for your help in advance and sorry if I did not explain myself correctly
Cheers. Odin