Document similarity with Wikipedia or Google news

249 views
Skip to first unread message

Jay Qadan

unread,
Nov 30, 2018, 9:53:34 PM11/30/18
to Gensim
am trying to use this example Doc2vec-wikipedia but to use similarity with an arbitary document like a news article in attached sample. Due to computational challenges, I used 'text8' instead of full wikipedia dump using  the gensim api.load("text8"):

  • is this the best approach (doc2vec) to find document similarity with large corpus? any suggestion if there is a better method to get similaities based on topic rather than similar words?
  • As suggested, I used this code to look up similarity, considering that I use larger number of words than just: 'machine','learning'
    • print(model.docvecs.most_similar(positive=[model.infer_vector(['machine','learning'])], topn=20))
    • however the result I get is in this format: (502, 0.5730128288269043), (94, 0.5560649633407593), (187, 0.5478538870811462) not article title as in the orginial example, any suggestions how to get the article titles in 'text8' corpus?
UK Poisoning.txt

Gordon Mohr

unread,
Dec 1, 2018, 1:20:46 AM12/1/18
to Gensim
'text8' is just bulk text from part of Wikipedia concatenated together for compression tests. It's lost the article boundaries and titles, and further may only be from "early" articles in some sorted collection of articles. These factors make it almost entirely useless for meaningful doc-vector training. (The mere fact that gensim's handling will break it into manageably-sized lines creates pseudo-documents, and these lines will often have long runs of sentences from individual articles, could mean the doc-vectors have some slight topical power. But the doc-ids will still just be line numbers.)

You'd have to work with a better dump, where the documents are per-article and the tags are article titles, to get more meaningful results back. (If the full dataset is too large, discarding short articles, and truncating larger articles to a few hundred or thousand words, might help make training memory/times more manageable.)

- Gordon

Jay Qadan

unread,
Dec 1, 2018, 8:31:06 AM12/1/18
to Gensim
Suppose I want to train only the titles of Wikipedia dump, so the matching will be with the article titled rather than the whole content, how to do it in gensim?

Gordon Mohr

unread,
Dec 1, 2018, 12:28:27 PM12/1/18
to Gensim
You could find a list of all article titles, and use each title as both the tokenized `words` of a document, and that document's single string `tag`. 

But I wouldn't expect very good results from such an approach. Doc2Vec works better with documents that are at least a few dozen words – and documents that are just article titles would often be just 1-4 words. 

Using the first few dozen to hundreds of words from each article would likely work better, or some sort of abstract/summary of the articles. 

I don't know the current format/quality of the abstracts available at <https://dumps.wikimedia.org/enwiki/latest/>, but those might work. 

Alternatively, there's a Wikipedia API call that gets back a "summary" of an article (typically the first paragraph before any other named sections): <https://en.wikipedia.org/api/rest_v1/#!/Page_content/get_page_summary_title>. You'd likely want to be careful if making bulk requests against this, to make requests at a measured pace, handle transient errors, and save the results for reuse to avoid redudant requests.

- Gordon

Jay Qadan

unread,
Dec 2, 2018, 5:44:51 AM12/2/18
to Gensim
thanks Gordon, on your suggesting of truncating larger articles, how to achieve that? suppose I want to download api.info("wiki-english-20171001") , how to truncate it?

Gordon Mohr

unread,
Dec 2, 2018, 2:36:41 PM12/2/18
to Gensim
You could follow the example of the doc2vec-wikipedia notebook (https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb) to the point of getting the Wikipedia data, but then write the items you get back from `get_texts()` to an interim file of title & tokens – discarding tokens in excess of some threshold before writing. (This one-time process could also discard too-small articles.)

Then, read that file back into a new corpus-iterator to do your training. On the downside, you'd still have to download and scan the full dump once. On the upside, the truncated file may be much faster to re-iterate over for multiple training passes – as it's now just the titles & plain text, rather than original XML dump. 

Alternatively, look into the abstracts-download or per-article summary downloading I'd mentioned in the previous message. 

I'd not recommend the use of `api.load()` for anything you could reasonably do yourself - it hides steps/details in unhelpful ways. 

- Gordon

Benedict Holland

unread,
Dec 3, 2018, 11:34:09 AM12/3/18
to gen...@googlegroups.com
Use cos similarity. Doc2vec gives word embeddings, not document similarity. There are a variety of extensions using cos similarity like incorporating a distance to important words. 

Thanks,
~Ben

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

unread,
Dec 3, 2018, 6:18:57 PM12/3/18
to Gensim
All the existing similarity methods on Word2Vec/Doc2Vec/KeyedVectors already use cosine similarity. 

Doc2Vec will always train vectors for the document tags provided, but only train word-vectors in some modes. So Doc2Vec gives doc-embeddings that can be used for document-similarity for sure, but only sometimes gives useful word-embeddings. 

- Gordon


On Monday, December 3, 2018 at 8:34:09 AM UTC-8, Benedict Holland wrote:
Use cos similarity. Doc2vec gives word embeddings, not document similarity. There are a variety of extensions using cos similarity like incorporating a distance to important words. 

Thanks,
~Ben

To unsubscribe from this group and stop receiving emails from it, send an email to gensim+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages