Exclude stopwords when comparing similarity of documents

21 views

Skip to first unread message

Alain Désilets

unread,

Mar 19, 2018, 2:11:26 PM3/19/18

to DKPro Similarity Users

I need to compare the similarity of a bunch of job offers.

When I do a Mallet word embedding similarity between pairs of job offers, they always come up with a large similarity of > 0.95.

I suspect the reason for this is that the language used in job offers is very standard and therefore all documents seem very similar on the surface.

I was wondering if it would be possible for me to feed the similarity evaluator a list of stopwords which includes not only generic english stopwords, but also words that are commonly used in most job offers.

Thx.

Alain Désilets

Torsten Zesch

unread,

Mar 19, 2018, 4:12:25 PM3/19/18

to Alain Désilets, DKPro Similarity Users

There is no such functionality in DKPro Similarity itself.

The idea is to use e.g. StopWordRemover from DKPro Core

https://dkpro.github.io/dkpro-core/releases/1.9.0/docs/component-reference.html#engine-StopWordRemover

before feeding the documents into the similarity measure.

If you are not using Core, I would still recommend the token list before feeding it to the similarity measure.

-Torsten

--
You received this message because you are subscribed to the Google Groups "DKPro Similarity Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dkpro-similarity-users+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages