NameError: name 'dictionary' is not defined

Tansu Taşçıoğlu

unread,

Feb 14, 2019, 11:58:50 AM2/14/19

to Gensim

Hello,I am trying to apply LSA and LDA with Gensim for my own corpus.I followed the instructions in https://radimrehurek.com/gensim/tutorial.html with the title 'Corpus Streaming-One Document at a time' but I get an error:

line 12, in __iter__
yield dictionary.doc2bow(line.lower().split())
NameError: name 'dictionary' is not defined

I will apply LSA and LDA to Turkish Wikipedia for my master thesis.I applied with python read file methods to other text files because of that the file size is small I didnt get an memory error.However,wikipedia file is huge the program is killed :S Does anybody know that how I can solve this problem?Thank you..

Gordon Mohr

unread,

Feb 14, 2019, 1:03:39 PM2/14/19

to Gensim

Note this tutorial assumes you will be stepping through all its code in order, so that the variables from earlier steps are still available. About 8 blocks/paragraphs up from <https://radimrehurek.com/gensim/tut1.html#corpus-streaming-one-document-at-a-time>, there's a line which assigns to `dictionary`:

dictionary = corpora.Dictionary(texts)

If you are adapting this code for other uses, you'll have to make sure there are similarly, appropriately-initialized variables available. (You might do this, in the same style of the tutorial, by preparing a `dictionary` variable before defining your iterable class. Or you might further enhance that class with an `__init__()` method that takes argument and does such preparation in the class, as in the `TxtSubdirsCorpus` example in this longer article about iterators & iterable objects: <https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/>.

- Gordon

Tansu Taşçıoğlu

unread,

Feb 15, 2019, 8:28:02 AM2/15/19

to gen...@googlegroups.com

But I want to change just documents part,I want to read from file memory friendly cause the program is KILLED every time on terminal :S

dictionary = corpora.Dictionary(texts) it is correct but,

texts = [[word for word in document.lower().split() if word not in stoplist]
          for document in documents]

and 
stoplist = set('for a of the and to in'.split())

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

Gordon Mohr <goj...@gmail.com>, 14 Şub 2019 Per, 21:03 tarihinde şunu yazdı:

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gordon Mohr

unread,

Feb 15, 2019, 1:39:19 PM2/15/19

to Gensim

I don't know what you're asking, or what this new `documents` code, and new "KILLED" problem, has to do with your previous question about `dictionary` not being defined.

But, the blog post I previously linked explains both reasons for and methods of iterating things from a file or files.

- Gordon

On Friday, February 15, 2019 at 5:28:02 AM UTC-8, Tansu Taşçıoğlu wrote:

But I want to change just documents part,I want to read from file memory friendly cause the program is KILLED every time on terminal :S

dictionary = corpora.Dictionary(texts) it is correct but,

texts = [[word for word in document.lower().split() if word not in stoplist]
          for document in documents]

and 
stoplist = set('for a of the and to in'.split())

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

Gordon Mohr <>, 14 Şub 2019 Per, 21:03 tarihinde şunu yazdı:

Note this tutorial assumes you will be stepping through all its code in order, so that the variables from earlier steps are still available. About 8 blocks/paragraphs up from <https://radimrehurek.com/gensim/tut1.html#corpus-streaming-one-document-at-a-time>, there's a line which assigns to `dictionary`:

dictionary = corpora.Dictionary(texts)

If you are adapting this code for other uses, you'll have to make sure there are similarly, appropriately-initialized variables available. (You might do this, in the same style of the tutorial, by preparing a `dictionary` variable before defining your iterable class. Or you might further enhance that class with an `__init__()` method that takes argument and does such preparation in the class, as in the `TxtSubdirsCorpus` example in this longer article about iterators & iterable objects: <https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/>.

- Gordon

On Thursday, February 14, 2019 at 8:58:50 AM UTC-8, Tansu Taşçıoğlu wrote:
Hello,I am trying to apply LSA and LDA with Gensim for my own corpus.I followed the instructions in https://radimrehurek.com/gensim/tutorial.html with the title 'Corpus Streaming-One Document at a time' but I get an error:
line 12, in __iter__
yield dictionary.doc2bow(line.lower().split())
NameError: name 'dictionary' is not defined

I will apply LSA and LDA to Turkish Wikipedia for my master thesis.I applied with python read file methods to other text files because of that the file size is small I didnt get an memory error.However,wikipedia file is huge the program is killed :S Does anybody know that how I can solve this problem?Thank you..

--
You received this message because you are subscribed to the Google Groups "Gensim" group.

To unsubscribe from this group and stop receiving emails from it, send an email to gensim+unsubscribe@googlegroups.com.

Tansu Taşçıoğlu

unread,

Feb 18, 2019, 2:42:12 AM2/18/19

to gen...@googlegroups.com

I guess,I should give more detail about this situation.I have three datasets which their size are 60MB,108MB and 420MB(for wikipedia).I use for all datasets to read from file:

file = open ("mycorpus.txt","r")

documents = file.readlines() #mycorpus is consists of one line one document file such as Radim's documents[] in the tutorial

file.close()

I can get result for LSA for the datasets which their size are 60MB and 108 MB but I can not get result for LSA for wikipedia(420MB).When I run the code for wikipedia it runs approx. 2,5 hours and I get the KILLED message on the terminal.Therefore I think that because of the size, I should use more memory friendly way.I checked the tutorials there is a way to read memory friendly but I get name 'dictionary' is not defined error.After your reply I checked tutuorial again and again but I am very confused that where I should create MyCorpus object.If I follow the tutorial step by step then I again get name 'dictionary' is not defined error.Thank you...

Reply all

Reply to author

Forward