Stoplists and filtering dictionaries

645 views
Skip to first unread message

James Bridle

unread,
Dec 2, 2014, 4:05:58 PM12/2/14
to gen...@googlegroups.com
Hello - 

Thanks very much for gensim - it's a fantastic package. I've got up and running very quickly. I'm trying to do LSA on a corpus of some 1.25 short texts, and just struggling a bit with the order I should do things in order to get the most useful results.

I have a text file (actually a one-column CSV output) of all the texts, one line per document.

I can create a corpus and mm file from these, and then do the LSA stuff described in the tutorials. (Resulting in .dict, .mm, and LSI .plk files.) 

However, I haven't removed common words using stoplists and filtering extremes, and I want to go back and do this properly. 

Before LSA, I took the basic steps:

from gensim.corpora import TextCorpus, MmCorpus, Dictionary
background_corpus = TextCorpus(input="texts.csv.bz2")
background_corpus.dictionary.save("my_dict.dict")
MmCorpus.serialize("background_corpus.mm",background_corpus) 

Should I remove words from the dictionary file alone before or after creating the corpus? 
Is the mm corpus created using this dictionary, or are the two separate processes?
Can I remove words from the dictionary file after creating a corpus, or do I need to rebuild it?

In short, I need to understand the ongoing relationship between dictionary and corpus a bit better (both are used to create the LSI, as I understand it), and any advice would be greatly appreciated.

Thanks!

James

Radim Řehůřek

unread,
Dec 3, 2014, 5:42:35 AM12/3/14
to gen...@googlegroups.com
Hello James,

thank you for the kind words.

Dictionary (mapping words to ids) and vector corpora are indeed related.

There used to be an FAQ page on gensim's github wiki, but now I see the page is gone. I don't know why, maybe it was vandalized.

There's an "online filtering" example there (at the bottom).

The simpler option is to load back your dictionary, run filter_extremes/whatever else to change it, and serialize your corpus again (using this new dictionary).

More generally, when you change your dictionary (=change word ids, number of words...), you have to update/create the corpus as well, to reflect these new ids. Changing dictionary doesn't affect the corpora generated from it "automatically".

HTH,
Radim

Christopher S. Corley

unread,
Dec 3, 2014, 5:46:56 AM12/3/14
to gensim
Looks like the FAQ on the wiki was deleted by user "zhbzz2007" 2 days ago.
I've reverted the commit on the wiki, so the page should be back online now at
https://github.com/piskvorky/gensim/wiki/Recipes-&-FAQ

cheers,
Chris.

Excerpts from Radim Řehůřek's message of 2014-12-03 04:42:35 -0600:

James Bridle

unread,
Dec 3, 2014, 5:49:26 AM12/3/14
to gen...@googlegroups.com
Great, thanks. That example makes it much clearer!

Radim Řehůřek

unread,
Dec 3, 2014, 6:12:49 AM12/3/14
to gen...@googlegroups.com
To be clear: that FAQ example is for "online filtering" (option 1).

Creating the corpus from scratch using the new dictionary (option 2), ignoring the old serialized corpus, is conceptually much simpler.

Unless my original data source was not longer available, or took too long to re-process, I'd always go for option 2 for simplicity :-)

But both are valid of course.

Best,
Radim

Alex Ma

unread,
Oct 8, 2017, 11:25:39 AM10/8/17
to gensim
Dear Radim,
Hi, it seems your are very professional in gensim.
Can you help me with the following:

I have dictionary. I try to filter tokens with filter_tokens
But the result is "none"?

   print dict3.filter_extremes(no_above=4000)
   none

What is wrong?
Thank you.

среда, 3 декабря 2014 г., 13:42:35 UTC+3 пользователь Radim Řehůřek написал:

Ivan Menshikh

unread,
Oct 9, 2017, 3:21:03 AM10/9/17
to gensim
Hi Alex,

This method return None always, to make sure that everything works fine, you can check the length of the dictionary before/after filtering (OR enable logging messages)

Short example:
from gensim.corpora import Dictionary
import logging


logging.basicConfig(level=logging.INFO)

corpus = [
   ["a", "a", "a", "b", "b"],
   ["b", "b", "b", "c"]
]

dct = Dictionary(corpus)
assert len(dct) == 3

dct.filter_extremes(no_below=2, no_above=1.)
assert len(dct) == 1


Output

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO
:gensim.corpora.dictionary:built Dictionary(3 unique tokens: [u'a', u'c', u'b']) from 2 documents (total 9 corpus positions)
INFO
:gensim.corpora.dictionary:discarding 2 tokens: [(u'a', 1), (u'c', 1)]...
INFO
:gensim.corpora.dictionary:keeping 1 tokens which were in no less than 2 and no more than 2 (=100.0%) documents
INFO
:gensim.corpora.dictionary:resulting dictionary: Dictionary(1 unique tokens: [u'b'])
Reply all
Reply to author
Forward
0 new messages