Newbie trying to use D2V for genomes

29 views

Skip to first unread message

Odin Morón-García

unread,

Jun 21, 2022, 8:34:11 AM6/21/22

to Gensim

Good day everybody. I've been looking at the forum for a solution from my question but I keep unsure about two things on used Doc2Vec with Paragraph IDs

If I understood it well, Word2Vec make good use of the context of the words, and it needs to have the words repeated in many places to create "a context", in such a way, it 'finds' the semantics of words and place together Queen and King for example (just because they tend to appear in sentences as the King had a crown, the Queen was crowned" ... many times).

What I can't understand it is why the tags for the documents can be unique, so a Paragraph ID would be Doc0001,Doc0002,...,Doc9999

To express better my question I am using Doc2Vec in three ways but getting sort of bad results and I can't see what would be the normal way of doing it.

In one case my documents, genomes as if they were a really long sentence with low repetitions and often sections that are repeated between documents (called operons), have an unique ID and D2V place those that I expect to be sort of similar not so close.

The second one, I have those genomes broken in parts (like it is a long sentences, it may break in random sections), then for the D2V each section has the same tag but quite different content. This one is really scattered in a T-SNE plot.

The third that I want to try but I do not know if it make sense, would be a kind of sentiment analysis / collaborative filtering where I rather than put the ID I put the characteristics, for example "Photosynthetic".

Does anyone can give me a clue if the unique ID is thought for putting together thousands of texts (so thousands of unique IDs). it sort of make sense to me that the tags must be repeated?

On the other hand, anyone knows what can happen in cases where most of the sentences is a sort of random order, except regions of it?

Thanks a lot for your help in advance and sorry if I did not explain myself correctly

Cheers. Odin

Gordon Mohr

unread,

Jun 21, 2022, 1:29:33 PM6/21/22

to Gensim

While I've seen writeups, & questions here, suggesting people have used word2vec/doc2vec/etc algorithms successfully in genomic applications, keep in mind that their main track record, & most writeups/rules-of-thumb, come from natural-language data/applications. For any domain where the breadth, or relative frequencies, or reliable cooccurrence patterns, are very different, these algorithms might have a harder time, or require far more experimental tuning (in data preprocessing & metaparameters), to deliver good results.

Answers/comments interleaved below...

On Tuesday, June 21, 2022 at 5:34:11 AM UTC-7 omg...@gmail.com wrote:

Good day everybody. I've been looking at the forum for a solution from my question but I keep unsure about two things on used Doc2Vec with Paragraph IDs

If I understood it well, Word2Vec make good use of the context of the words, and it needs to have the words repeated in many places to create "a context", in such a way, it 'finds' the semantics of words and place together Queen and King for example (just because they tend to appear in sentences as the King had a crown, the Queen was crowned" ... many times).
What I can't understand it is why the tags for the documents can be unique, so a Paragraph ID would be Doc0001,Doc0002,...,Doc9999
To express better my question I am using Doc2Vec in three ways but getting sort of bad results and I can't see what would be the normal way of doing it.

The point of the `tags` is to provide the keys for looking-up learned vectors, where those vectors describe a whole 'document'. In the most straightforward, & original, application of the 'Paragraph Vector' algorithm (aka 'Doc2Vec' in Gensim), every document gets just a single tag that is its unique ID. It's just an opaque look-up key: if the 1st document was given the single tag `Doc0001`, then after training, you can look-up the vector that was learned, as best-predicting that document's words in the trained model, via that key `'Doc0001'`. That's the only significance of the tags.

(Gensim `Doc2Vec` also supports supplying multiple tags per text, which results in training that's roughly similar to the effect if the document were repeated multiple times with alternate tags. And, you can choose to repeat tags across multiple documents, which has an effect roughly similar to if *all* those texts were in one virtual document with the single tag... even though those text-ranges may be spread throughout the corpus. These styles of use have far less published esperience behind them, so anything tried should be considered experimental, being careful to verify that the results still make sense.)

I don't know enough about your domain to know what your tags 'should' be... but roughly, they should correspond to the logically-related document-like groupings that your downstream steps want to reason about.

In one case my documents, genomes as if they were a really long sentence with low repetitions and often sections that are repeated between documents (called operons), have an unique ID and D2V place those that I expect to be sort of similar not so close.

It's not clear from your wording here whether your documents are 'genomes' or 'operons', and whther any preprocessing has already turned repeating sections into pseudowords (unique tokens). It'd be easier to understand your goals & challenges with more specific examples, with relative lengths/counts. (How many 'words' in your training data? How long is each 'document'? How many 'documents'? oes every 'document' have a unique ID - and are these arbitrary, like serial numbers unique to your dataset, or meaningful because the 'documents' are already well-named distinct things?)

Also, how does one form an expectation that things should be "sort of similar", and what's the criteria for deciding the results are "not so close"?

The quality of these models will depend a lot on the quality/variety/quantity of training data, whether there are real cooccurrence patterns in the algorithm's ability to notice, and whether the model's parameters are well-tuned – especially things like `window` and `vector_size` – to be able to capture whatever patterns are there.

If you have (or can create) some 'ground truth' – even fuzzy – of what results *should* be like, that helps a lot to evaluate model choices. For example, if you can create a non-trivial list of pairs of tags that "should" be closer than other pairs, in whatever sense you're hoping the model learns, then you can make an automated model-check. That'd allow you to score a model on how many, of those reference pairs, it successfully puts closer than other pairs.

(In some of the early published Doc2Vec work, preexisting human-curated categorical labels were used for this: the evaluation conjecture was "2 docs in the same category should be given vectors that are closer to each other, than to a 3rd randomly-chosen document not in the same category". Note that's not a perfect evaluation! Categories might have good reason to group two docs that are very different in most respects! A random other doc might in fact be very close, in many other language-influenced ways, to a doc not in its same formal category, for both good reasons or simple errors/oversights. So getting "100%" on such an evaluation not even possible or desirable. But still, by providing a bulk, automatable, generally-in-the-right-direction indication of where slightly "better" model possibilities lived, that was enough to guide model parameter choices, & say some fuzzy things about relative model quality under different assumptions.)

The second one, I have those genomes broken in parts (like it is a long sentences, it may break in random sections), then for the D2V each section has the same tag but quite different content. This one is really scattered in a T-SNE plot.

I don't understand the motivation here: why would the same tag be repeated for "quite different content"? What's the tag, in that case, supposed to represent/capture?

Also, a dimensionality-reduction & the aesthetics of a subsequent plot-via-TSNE won't necessarily be a good indicator of model quality. These dense, high-dimensional models specifically shine where our 2D/3D imaginations may be very, very challenged.

The third that I want to try but I do not know if it make sense, would be a kind of sentiment analysis / collaborative filtering where I rather than put the ID I put the characteristics, for example "Photosynthetic".

Some projects have certainly injected known aspects of the 'documents' into training, as either synthetic pseudowords in the texts, or multiple overlapping doc-tags. Whether this helps for any downstream use will be very data- & choices- & goals-specific matter. It's also reasonable in many cases to leave such data *out* of the unsupervised dense-vector-modelling, but then use the representational vectors in a downstream classifier that's trained with those known characteristics.

Does anyone can give me a clue if the unique ID is thought for putting together thousands of texts (so thousands of unique IDs). it sort of make sense to me that the tags must be repeated?

A unique-and-arbitrary ID, for one-to-one lookup of original documents, is definitely the original/baseline case – and I'd try to get some things working, and some sense of the various tradeoffs, in that simple model before trying other more-speculative approaches (like many-tags-per-document, or tags that are really descriptive labels which repeat over many documents).

On the other hand, anyone knows what can happen in cases where most of the sentences is a sort of random order, except regions of it?

Because the training of these models is a giant tug-of-war between contrasting examples, wherein over many training passes only the reliable patterns get reinforced into persistence, generally any amount of 'noise' won't necessarily ruin the results, but may slow training, requiring more passes to converge on the best-achievable representations.

In all cases, though, you'd want to ensure your training data has all sorts of contrasting examples equally spread throughout the corpus. (If your method of creating the corpus tended to, for example, put all examples with on interesting characteristic in a clump at the front, and some other in the back, the sort of 'best compromise' between those in the final weights will be harder to find, & the trick-of-the-ordering may leave the model at the end a little more pulled in one direction, than with equal-interleaved examples throughout. In natural-language training sets, usually one 'shuffle' of the data at the beginning, if it had any risk of such word/word-sense imbalances, is enough to neutralize such concerns.)

Separately buut perhaps relevant: note that the Gensim Word2Vec/Doc2Vec/FastText classes only train on texts of up-to-10000 tokens. Tokens past the 10000 mark in any text are ignored. So if you have larger texts, you should pre-split them into no-larger-than-10000 token texts. (And, in Doc2Vec, simply giving each chunk of an oversized doc the same shared tag will have roughly the same effect on the final tag vector as if they were all trained in one text.)

- Gordon

Odin Morón-García

unread,

Jun 22, 2022, 9:50:47 AM6/22/22

to Gensim

Thanks a lot Gordon! I am afraid I was tired and with the aim of not dig into the biological details I did not explain myself very well. I'll be back tonight and have a better explanation of the few points you asked. I reckon it maybe a good use of the Doc2Vec for my problem (elsewhere there's a gene2Vec version but not quite what I need).