XML Code similarity with Doc2Vec, weighted embeddings

352 views

Skip to first unread message

Halit Vural

unread,

May 10, 2022, 1:31:30 PM5/10/22

to Gensim

Hello everyone,

I am trying to train my Doc2Vec model to see the similarity in my XML code. I have a quite tiny corpus: Hundreds of lines only. No more.

I tried grid search with

params = {

'dm' : [0,1],

'vector_size' : [40, 50, 60, 70, 80],

'window' : [8, 9, 10, 11, 12, 13, 14],

}

I train my model using the code:

model = Doc2Vec(tagged_data, dm=dm, vector_size=vector_size, window=window, min_count=1, workers=4, epochs = 150)

where the tagged_data includes my XML lines and their document tags.

And I tested my model giving any line from the same corpus. I have low accuracy results such 0.92 with a number of wrong predictions. It simply gives me another line. That means, my model is underfitting.

I decided to use Cross-validation (giving each line as a test input) with grid searching different set of parameters, but it didn't work either.

My resulting parameters were:

{'dm': 0.7972972972972973, 'vector_size': 50.54054054054054, 'window': 11.77027027027027}

where I averaged them from the results of all lines from the corpus.

I selected my parameters as:

{'dm': 1, 'vector_size': 51, 'window': 12}

Therefore, my new model was like:

model = Doc2Vec(tagged_data, dm=1, vector_size=51, window=12, min_count=1, workers=4, epochs = 100)

Unfortunately, I could not get a good model out of this work. It does not find the exact line from the corpus. It gives me another line with a better similarity. Or at least, the second similar is very close to the real code.

I am planning to give weights to some code parts. Because, I believe some parts of the code has more importance than the rest. But I don't know how to do it. Would you suggest me a resource to learn weighing embeddings, please?

Thank you in advance.

Halit

Gordon Mohr

unread,

May 11, 2022, 9:12:47 PM5/11/22

to Gensim

On Tuesday, May 10, 2022 at 10:31:30 AM UTC-7 bosna...@gmail.com wrote:

Hello everyone,

I am trying to train my Doc2Vec model to see the similarity in my XML code. I have a quite tiny corpus: Hundreds of lines only. No more.

I tried grid search with

params = {
'dm' : [0,1],
'vector_size' : [40, 50, 60, 70, 80],
'window' : [8, 9, 10, 11, 12, 13, 14],
}

I train my model using the code:

model = Doc2Vec(tagged_data, dm=dm, vector_size=vector_size, window=window, min_count=1, workers=4, epochs = 150)

where the tagged_data includes my XML lines and their document tags.

`Doc2Vec`, like other algorithms in the `Word2Vec` family, needs lots of training data to train & essentially "fill" the high-dimensional space usefully. I wouldn't expect "hundreds" of docs to be able to fill even the smallish-dimensional models you're trying.

There's no theoretical basis for this, but as a very very rough rule-of-thumb, in `Word2Vec`, I wouldn't expect to train an N-dimensional dense embedding unless the vocabulary has at least N*N distinct words – and plenty of varied, subtly-contrasting usage examples of all those N*N words. (One or a few examples don't tend to give good vectors – there'll be just a few training-visits to those examples, and your model will reflect whatever those few, likely idiosyncratic, usages imply, rather than a more-generalizable vector that might be possible with a wider range of examples.

So I wouldn't expect hundreds-of-docs to give a doc-vector space of even a meager 40 dimensions.

Also: `min_count=1` is almost always a bad idea with these algorithms. Again, one (or even jsut a few) usage-examples don't have much chance of training a 'good', generalizable representation of a word. But given things like the zipfian distribution of word frequencies in typical natural langauge text, there are a *lot* of such singleton/few-count words. So you may find the majority of the model's state, & training time, struggling from the influence of insufficient examples, and this even interferes with the vectors for other more-common words. *Discarding* rare words, as with the default `min_count=5` (or even higher values when you have more data) usually gives better results.

My general sense (from limited experiments) is that `min_count=1` can be especially damaging to `Doc2Vec` trainings, where you usually have only one text per document-tag (id). Those single-appearance words, in their single-appearance docs, either compete with the doc-vector as explainers of their contexts (in PV-DM mode, or PV-DBOW with skip-gram words), or have an oversized influence on the doc-vector – as the data suggests they are 100%-suggestive of the matching doc-vector (when real relationshisp are almost always more-subtl than that).

Further, while it's certainly possible to apply these algorithm to things other than true natural-language text, like XML, their performance could be very sensitive to the specifics of what's in the XML, and your tokenization choices. For example, does the XML contain real natural prose in its elements – or just more rigorously-formatted data? How are elements, attributes, & element-bodies each tokenized (if at all)? How many tokens are in one of your typical docs?

The more the XML is like natural-language, the more I'd expect `Word2Vec`-like techniques to do something useful. But if it's just dumps of database tables, with things like scalar data or selections from narrow controlled-voccabularies, it might not work well without lots of other tuning.

And I tested my model giving any line from the same corpus. I have low accuracy results such 0.92 with a number of wrong predictions. It simply gives me another line. That means, my model is underfitting.

I'm not sure what kind of testing you're implying here. What's the 'accuracy' you're evaluating? 0.92 accuracy doesn't seem too bad! And, without seeing your method of evaluation, there could be other problems, & potential improvements, in your approach. (There's not enough info here to conclude "underfitting", especially given the likely data-insufficiency issue.)

I decided to use Cross-validation (giving each line as a test input) with grid searching different set of parameters, but it didn't work either.

My resulting parameters were:

{'dm': 0.7972972972972973, 'vector_size': 50.54054054054054, 'window': 11.77027027027027}

where I averaged them from the results of all lines from the corpus.

I selected my parameters as:
{'dm': 1, 'vector_size': 51, 'window': 12}

Therefore, my new model was like:
model = Doc2Vec(tagged_data, dm=1, vector_size=51, window=12, min_count=1, workers=4, epochs = 100)

It's not clear what cross-validation would mean in this sort of scenario. If you're holding back 1 line (document) from each training, then what's the test, post-training, using that 1-held-back-line, that tells you if the model's succeeded or failed.

Also, `dm` is essentially a boolean value, `vector_size` an integer (usually best left as a multiple of 4 for a slight performance benefit), and `window` an integer (that has no meaning in `dm=0` PV-DBOW mode unless separate option `dbow_words=1` is toggled on). So any automated optimization process that reports floating-point values for these parameters is probably nonsense.

Unfortunately, I could not get a good model out of this work. It does not find the exact line from the corpus. It gives me another line with a better similarity. Or at least, the second similar is very close to the real code.

I am planning to give weights to some code parts. Because, I believe some parts of the code has more importance than the rest. But I don't know how to do it. Would you suggest me a resource to learn weighing embeddings, please?

I doubt the issues here can be fixed with any sort of extra-reweighting. Rather, you likely need more data, & a better fit of the process/algorithm choice to whatever your true ultimate goal may be. What's the expected benefit for finding the "most similar" or "top N most similar" other XML documents from your set? What's the essence of the 'similarity' you'd like to detect? Have you tried any other techniques, or vectorizations of the raw XML?