Doc2Vec and summarizer

已查看 418 次
跳至第一个未读帖子

ahe61

未读,
2019年9月3日 04:21:132019/9/3
收件人 Gensim
Hi,

I've a set of reviews in a niche domain (medical reports). The typical length of a review is approx 150 words (one paragraph). The total number of different root words (lemmas) in the documents is approx. 1000. Each . 

My workflow is as follows

1. tokenize - identify terminology (and fix as one word) - spell correction - lemmatization (all spaCy)
2. w2v (Gensim), and check if there are words that look like a term (headcahe for headache) but are clearly typos. Correct those
3. repeat 1, 2 until you are happy
4. Do the production w2v (Gensim) and do whatever you wish to do with the results
5. Do the production doc2vec (Gensim) and do whatever you wish to do with the results

I'm most interested in summarization; reviews of the same product have a lot of similarities. However, I cannot find how to use Gensim's text summarization that uses steps 1-5. It takes raw text, does a bit of tokenization and gives a keyword/text summary using TextRank. So basically throws out all the cleaning that has been done so far

Before you start beating me up about numbers: 3000 reviews, 1/3-2/3 positive/negative labels. The spaCy documentation indicates that 3000 should be more than enough once you've gone through the full cleaning-up

thanks, Andreas
已删除帖子

Gordon Mohr

未读,
2019年9月3日 14:27:262019/9/3
收件人 Gensim
Not familiar with the SpaCy docs or practices about what's "more than enough", but regarding those corpus-size number:

(3000 docs * 150 words/doc =) 450,000 words is fairly small for Word2Vec training. Also, only 1000 unique words is very small; I wouldn't expect to get string 300d or even 100d word-vectors for such a tiny vocabulary. (Maybe, 20-32d vectors?)

Also, 3000 docs is very small for `Doc2Vec`: published work often uses tens-of-thousands to millions of training documents. 

You are correct that gensim's summarization functions have no options in their interface to provide your own tokenization/comparison options. The `summarize()` takes just a raw string, and you're stuck with its fixed, internal sentence- and word-tokenizations, and its other processing – which doesn't appear to make use of things like word-vectors anyway, as opposed to simple exact word co-occurrences. 

So if your cleaning process can output a plain-string improved version - especially with regard to typos – that is still readable text, it might help the summarizer a little. But vector-modeling can't help at all.

And, while lemmatization might help the algorithm notice sentence-interrelationships a little, it'd also result in "summaries" that include lemmatized words, which may not be what's wanted. 

So I'm not sure I'd expect much from "gensim.summarization". It's pretty simple and inflexible code, without strong links to other gensim algorithms & practices – and even the tutorial examples are unimpressive. 

- Gordon 

andreas heiner

未读,
2019年9月4日 00:29:132019/9/4
收件人 gen...@googlegroups.com
Hi Gordon,

thanks for the feedback. 

Standard Gensim does tokenization, and, depending on the author, removes numbers, a set of stopwords and non-ascii symbols etc., with the philosophy that these words don't contribute to the story. For understanding the story it's often not relevant if it's "he" or "she", plural or singular, past or present tense; lemmatization in the extreme makes sense. 

With a lemmatized Word2Vec experiment I found most typos (encouraging). Doc2Vec gives you also nicely matching documents (at the lemmatized level and orginal level). Hence my idea that summarization also could/should work, also since (English language) in normal conversation 1000 lemmas count for 75% of the English corpus
 (https://web.archive.org/web/20111226085859/http://oxforddictionaries.com/words/the-oec-facts-about-the-language). Your comment that you can't have 100d vectors is also true: I got some seriously weird results when doing a sentiment analysis with just pos/neg answers. I will give it a new try with much shorter vectors. 

Thanks, I'll get back to the forum once I've done some more experiments

best,

andreas




On Tue, Sep 3, 2019 at 9:25 PM Gordon Mohr <goj...@gmail.com> wrote:
Not familiar with the SpaCy docs or practices about what's "more than enough", but regarding those corpus-size number:

(3000 docs * 150 words/doc =) 450,000 words is fairly small for Word2Vec training. Also, only 1000 unique words is very small; I wouldn't expect to get string 300d or even 100d word-vectors for such a tiny vocabulary. (Maybe, 20-32d vectors?)

Also, 3000 docs is very small for `Doc2Vec`: published work often uses tens-of-thousands to millions of training documents. 

You are correct that gensim's summarization functions have no options in their interface to provide your own tokenization/comparison options. The `summarize()` takes just a raw string, and you're stuck with its fixed, internal sentence- and word-tokenizations, and its other processing – which doesn't appear to make use of things like word-vectors anyway, as opposed to simple exact word co-occurrences. 

So if your cleaning process can output a plain-string improved version - especially with regard to typos – that is still readable text, it might help the summarizer a little. But vector-modeling can't help at all.

And, while lemmatization might help the algorithm notice sentence-interrelationships a little, it'd also result in "summaries" that include lemmatized words, which may not be what's wanted. 

So I'm not sure I'd expect much from "gensim.summarization". I the tutorial examples don't show any impressive results. 

On Tuesday, September 3, 2019 at 1:21:13 AM UTC-7, ahe61 wrote:

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/e817bd71-353a-41c3-b623-70c7b8ef94f7%40googlegroups.com.

Sabkat Khattak

未读,
2019年9月9日 18:02:192019/9/9
收件人 Gensim
Respected Sir, 
                        I have found your multiple replies to different questions being asked related to nlp and in particular the two models doc2vec and word2vec. I have found that you have deep understanding of these models. I am working on doc2vec model, but despite doing whatever possible with it, I cannot achieve my desired results. I would like to speak to you about it, and if you can please find some time for me, I shall be really grateful to you. Thank you
回复全部
回复作者
转发
0 个新帖子