How to use Mallet Embeddings?

Alain Désilets

unread,

Mar 8, 2018, 6:05:42 AM3/8/18

to DKPro Similarity Users

I have used ESA embeddings for evaluating similarity, but it's pretty slow.

I am now trying to generate domain specific embeddings and have been able to do so using this sample code:

https://github.com/dkpro/dkpro-core-examples/blob/master/wordembeddings-asl/src/main/java/de/tudarmstadt/ukp/dkpro/core/examples/embeddings/EmbeddingsPipeline.java

Now I want to use the resulting word embeddings file to evaluate similarity between documents.

Is there an example somewhere that shows how t do this? I have been searching on the mailing lists and the Github repos and can't find anything.

Thx.

Alain

Alain Désilets

unread,

Mar 8, 2018, 6:37:51 AM3/8/18

to DKPro Similarity Users

Nevermind. Looking at the example code for ESA, I now realize that it generalizes to any vector space model, whether it be Wikipedia-based ESA model or say, a Mallet word embeddings model. You just have to pass the path of the file that captures the VSM representation of all the words in your training corpus.

Alain

Alain Désilets

unread,

Mar 8, 2018, 1:38:53 PM3/8/18

to DKPro Similarity Users

On Thursday, 8 March 2018 06:37:51 UTC-5, Alain Désilets wrote:

Nevermind. Looking at the example code for ESA, I now realize that it generalizes to any vector space model, whether it be Wikipedia-based ESA model or say, a Mallet word embeddings model. You just have to pass the path of the file that captures the VSM representation of all the words in your training corpus.

Alain

Hum... spoke too soon. When I tried to modify the ESA example to work from my mallet embeddings file, I discovered that the path you pass to the VectorIndexReader is actually a directory containing .jdb files for a DB that indexes the word embedding vectors. But what I am able to generate is just a single ASCII file containing the vectors in ascii format, one vector per line.

After much snooping around, I think I found a bunch of classes that I can put together to write a small app for converting the ASCII vector file to a BerkeleyDB dump. But I get the feeling that this app must already exist and I just haven't found it.

Basically, what I have in mind is to write an app that:

* Create a VectorIndexWriter

* Reads each line of the ASCII file

* Creates a SparseVector object from that line and uses the VectorIndexWriter's put() method to add the vector to the index

Does that sound about right? And also, does something like that already exist? I don't want to go re-inventing the wheel.

Torsten Zesch

unread,

Mar 8, 2018, 1:57:41 PM3/8/18

to Alain Désilets, DKPro Similarity Users

Dear Alain,

what you describe sounds like the right way to go.

AFAIK this does not exist for plain text formats.

I think this has mostly been used for ESA indexes and there is exactly one Converter ConvertLuceneToVectorIndex in the corresponding package. PlainTextToVectorIndex does not exist yet.

-Torsten

--
You received this message because you are subscribed to the Google Groups "DKPro Similarity Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dkpro-similarity-users+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward