Steps to improve results for Gensim Text rank

433 views
Skip to first unread message

vishnu...@gmail.com

unread,
Dec 19, 2017, 11:48:08 AM12/19/17
to gensim
Below is the code I used to preprocess the text and apply text rank(I followed the gensim textrank tutorial). Please help me with a method to get better results. My text data is a column from a csv with more than 2000 rows. (each row, a sentence). Output I get is 18 lines (Each different line, not a paragraph) of text as summary, and 20 words as keywords. 

reg_ex = r'[^a-zA-Z]'
replace = ' '
wordnet_lemmatizer = WordNetLemmatizer()
#stop = stopwords.words('english')

comp_df = df['COMMENT'].str.replace(reg_ex, replace).apply(lambda t: ' '.join([wordnet_lemmatizer.lemmatize(w)
                                                                                       for w in t.split()])).str.lower()
aa = comp_df.to_string()

import requests

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

from gensim.summarization import summarize
from gensim.summarization import keywords


print ('Summary:')
print (summarize(aa,ratio=0.01))

print ('\nKeywords:')
print (keywords(aa, ratio=0.01))


Ivan Menshikh

unread,
Dec 20, 2017, 12:58:39 PM12/20/17
to gensim
Hi,
as I remember (if you have English text), no need preprocess your text, try to pass raw text directly.
What's a problem with your results right now, can you describe it in details?

vishnu...@gmail.com

unread,
Dec 20, 2017, 1:23:05 PM12/20/17
to gensim
I expected a reasonable summary from the 2000 rows. I got the output as below: (Summary as sentences, and random keywords. Most of the key phrases were not 
'domain related'. Eg, my data is customer queries, it just listed out random keyphrases without capturing domain specific ones.

Summary:

70                          no child goes to school..
229     there is a huge gap in the understanding ...
282     process of making this a huge success is to ...


Keywords:
under
understand
succeed
text 

Ivan Menshikh

unread,
Dec 21, 2017, 4:36:50 AM12/21/17
to gensim
Small refinement, summarization interprets all of your input as "big text" and make summarization of it (for this reason, you receive 18 lines, not 2000).
You have several variants:
- try different parameters (ratio and word_count).
- try new algorithm for keyword extraction - https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/summarization/mz_entropy.py#L13 (in `develop` branch)
- Try to fit LDA model (I don't know your final target, but probably you want to understand "topics of customer queries")

vishnu...@gmail.com

unread,
Dec 21, 2017, 3:47:26 PM12/21/17
to gensim

Could not import 'mz_keywords'

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-36-965838370934> in <module>()
      2 from gensim.summarization import summarize
      3 from gensim.summarization import keywords
----> 4 from gensim.summarization import mz_keywords
      5 

ImportError: cannot import name 'mz_keywords'

Ivan Menshikh

unread,
Dec 21, 2017, 11:25:39 PM12/21/17
to gensim
Update your gensim version: pip install --upgrade gensim, this available in 3.2.0 (latest)
Reply all
Reply to author
Forward
0 new messages