Issues with DTM (dynamic time modeling)

639 views
Skip to first unread message

Jurica Seva

unread,
Sep 11, 2015, 12:24:15 PM9/11/15
to gensim
Hi all, 

I am trying to make gensim implementation of DTM work but I can't seem to find a cause for the following issue:\

2015-09-11 16:53:00,751 : INFO : training DTM with args --ntopics=100 --model=dtm  --mode=fit --initialize_lda=false --corpus_prefix=/Users/jurica/Documents/workspace/eclipse/TopicModeling/dtmPrefix/train --outname=/Users/jurica/Documents/workspace/eclipse/TopicModeling/dtmPrefix/train_out --alpha=0.01 --lda_max_em_iter=10 --lda_sequence_min_iter=6  --lda_sequence_max_iter=20 --top_chain_var=0.005 --rng_seed=0 

Traceback (most recent call last):

  File "/Users/jurica/Documents/workspace/eclipse/TopicModeling/topicModelingExample.py", line 158, in <module>

    dtm = DtmModel('binaryFiles/dtm-darwin64', corpus=mm, id2word=id2word, time_slices=time_slices, prefix='/Users/jurica/Documents/workspace/eclipse/TopicModeling/dtmPrefix/')

  File "/Users/jurica/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/gensim/models/wrappers/dtmmodel.py", line 121, in __init__

    self.train(corpus, time_slices, mode, model)

  File "/Users/jurica/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/gensim/models/wrappers/dtmmodel.py", line 197, in train

    self.em_steps = np.loadtxt(self.fem_steps())

  File "/Users/jurica/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/numpy/lib/npyio.py", line 738, in loadtxt

    fh = iter(open(fname, 'U'))

IOError: [Errno 2] No such file or directory: '/Users/jurica/Documents/workspace/eclipse/TopicModeling/dtmPrefix/train_out/em_log.dat'


So, basically, it says the file does not exist, which is true. I don't get why it doesn't exist... The dtmmodel.py file, which represents the implementation of DTM, doesn't offer any mechanism where that file would be created and stored to the location defined via prefix argument of DtmModel class (btw. prefix works with absolute paths and not relative ones! that took two hours of my table). 


So the line that causes the issue is 197 in dtmmodel.py:

        self.em_steps = np.loadtxt(self.fem_steps())


and it tries to open a file defined via the self.fem_steps() parameter (in my case /Users/jurica/Documents/workspace/eclipse/TopicModeling/dtmPrefix/train_out/em_log.dat). The only other time self.fem_steps() is mentioned is when the path is defined (line 141 in dtmmodel.py) but it is not used in any of the save file actions implemented in the dtmmodel.py... 


The code I am using is very simple: 


id2word = gensim.corpora.Dictionary.load_from_text('topcModeling/wikipages_tfidf.mm_wordids.txt')

mm = gensim.corpora.MmCorpus('topcModeling/wikipages_tfidf.mm_tfidf.mm')

time_slices = [20000,20000,20000,20000,20000,20000,20000,20000,20000,6557]


prefix = os.path.dirname(os.path.abspath(__file__)) + '/dtmPrefix/'

#print prefix

if not os.path.exists('topcModeling/DTM_WikiDump.model'):

    dtm = DtmModel('binaryFiles/dtm-darwin64', corpus=mm, id2word=id2word, time_slices=time_slices, prefix=prefix)

    dtm.save('topcModeling/DTM_WikiDump.model')

else:

    dtm = DtmModel.load('topcModeling/DTM_WikiDump.model')

    

doc_dtm = dtm[doc_bow]

doc_dtm.print_topics(20)


So, basically, it is trying to open a file that does not exist and is never created in the code. Any suggestions, ideas, other angles would be welcome. 


Best,

Jurica

Radim Řehůřek

unread,
Sep 12, 2015, 11:29:58 PM9/12/15
to gensim, artyom....@live.com
Hello Jurica,

not sure what happened there (maybe a change in the upstream DTM code?), as I've never used DTM myself.

But let me CC Artoym, who is the author of the DTM wrapper in gensim.

Artyom should have a better idea of what this error could mean.

Best,
Radim

Artyom Topchyan

unread,
Sep 13, 2015, 7:19:32 AM9/13/15
to gensim
This file gets generated by the actual DTM binary that is being run. Seems like the DTM binary is not getting invoked properly and so the file is not generated. Can you check if the binary is getting invoked correctly? This might be because the corpus in not being serialized properly into the temp folder, can you maybe check if that gets created prefix + 'train-mult.dat'?  Also I have not tested this on a Mac.

@Radim I havent touched the wrapper in a long while, but I might look into fixing some of the issues people have reported. Sorry for this. Also I will look into creating an example notebook for this. If Jurica says where he got the dataset I can try and recreate it perhaps.

Radim Řehůřek

unread,
Sep 13, 2015, 7:29:46 AM9/13/15
to gensim
That would be super helpful. Thanks so much Artyom!

Radim

Jurica Seva

unread,
Sep 14, 2015, 4:50:24 AM9/14/15
to gensim
Hi everyone, 

thanks for the responses. As far as the binary goes, I am using the dtm-darwin64 binary, downloaded from https://github.com/magsilva/dtm/tree/master/bin

The temp folder (dtmPrefix in my case) has train-mult.dat (388.1MB), train-mult.dat.vocab (588KB) and train-seq.dat files (51KB) and there is a subfolder  train_out which is empty and where the other files the library uses should be stored. 

The dataset I am using at the moment, just to test the module and see how it fits in the overall project, is just a wiki data dump taken from wikipedia dump site (the actual  file being enwiki-latest-pages-articles20.xml-p011125004p013324998.bz2).

How can I make sure the binary files are getting invoked properly? I used dtm-darwin64 as it forms the core set of components upon which OS X and iOS are based. Besides, it is released by Apple. 

J. 
Message has been deleted

Jurica Seva

unread,
Sep 23, 2015, 11:14:34 AM9/23/15
to gensim
Hi everyone,

I just tried running it on w Win8 64bit machine and I get the same error. Can it be that its not the binary but something else? My guess would be that its something with the paths that gets scrambled. 

Heres the code: 

import gensim, os
from gensim.models.wrappers import dtmmodel
from gensim.corpora.dictionary import Dictionary
from gensim import corpora
prefix = os.path.dirname(os.path.abspath(__file__)) + '\\dtmPrefix\\'
dict = Dictionary.load('extractedData_Temp_NoSem_250T_10P.dict')
corpus = corpora.MmCorpus('extractedData_Temp_NoSem_250T_10P.mm')
my_timeslices = [20000,20000,20000,20000,20000,20000,2852]
print 'dictionary:', dict 
print 'corpus:', corpus
print 'prefix:', prefix

model = gensim.models.wrappers.DtmModel('D:/_work/_research/GensimDTM/dtm-master/bin/dtm-win64.exe', corpus, my_timeslices, num_topics=20, id2word=dict,  prefix=prefix)


Heres the error msg:

dictionary: Dictionary(228068 unique tokens: [u'', u'personal effects', u'descriptive methods', u'mll1 complex', u'gag']...)
corpus: MmCorpus(122852 documents, 228068 features, 7892412 non-zero entries)
prefix: D:\_work\_research\GensimDTM\dtmPrefix\
Traceback (most recent call last):
  File "D:\_work\_research\GensimDTM\gensimDTM.py", line 13, in <module>
    model = gensim.models.wrappers.DtmModel('D:/_work/_research/GensimDTM/dtm-master/bin/dtm-win64.exe', corpus, my_timeslices, num_topics=20, id2word=dict,  prefix=prefix)
  File "C:\Users\Sheva\AppData\Local\Enthought\Canopy\User\lib\site-packages\gensim\models\wrappers\dtmmodel.py", line 121, in __init__
    self.train(corpus, time_slices, mode, model)
  File "C:\Users\Sheva\AppData\Local\Enthought\Canopy\User\lib\site-packages\gensim\models\wrappers\dtmmodel.py", line 196, in train
    self.em_steps = np.loadtxt(self.fem_steps())
  File "C:\Users\Sheva\AppData\Local\Enthought\Canopy\User\lib\site-packages\numpy\lib\npyio.py", line 738, in loadtxt
    fh = iter(open(fname, 'U'))
IOError: [Errno 2] No such file or directory: 'D:\\_work\\_research\\GensimDTM\\dtmPrefix\\train_out/em_log.dat'

Maybe its the prefix path I am passing? 

Any suggestions would be appreciated and helpful.

Best,
Jurica

Artyom Topchyan

unread,
Sep 24, 2015, 6:08:46 AM9/24/15
to gensim
Hi,

So I finally created an example for DTM https://github.com/Arttii/gensim_dtm_example. Only took me like a year. I apologize. 

So concerning your error, there seem to be something weird going on with the Binary DTM. Even though we tell it not to initialize from LDA even though we tell it not to by default with initialize_lda=False. As far as i had understood the DTM it is not necessary to init from LDA everytime. But currently as a workaround, setting initialize_lda=True will most probablly solve your issue.

Artyom

Radim Řehůřek

unread,
Sep 24, 2015, 9:02:54 AM9/24/15
to gensim, Lev Konstantinovskiy
Great, thanks Artyom :)

Do you mind if we move this example notebook inside gensim (CC Lev), and link to it from the DTM docs?

Radim

Artyom Topchyan

unread,
Sep 24, 2015, 10:07:59 AM9/24/15
to gensim, lev....@gmail.com
Sure sounds good. Should I make a Pull request? Or you will handle it?

Radim Řehůřek

unread,
Sep 24, 2015, 12:45:17 PM9/24/15
to gensim, lev....@gmail.com
Yes please!

The code will need some PEP8 code style love, but we can handle that in the pull request :)

-rr

Jurica Seva

unread,
Sep 25, 2015, 5:03:12 AM9/25/15
to gensim
Artyom, Radim,

thank you for the update; you were right, adding the flag solved the issue and everything works fine now. 

Best,
J. 

Jurica Seva

unread,
Sep 29, 2015, 5:50:34 AM9/29/15
to gensim
Hi guys, 

another update as it seems I am still getting some errors; the workaround presented a few posts ago works under win64 arhitecture. When running the code on unix or os x environments I am still getting errors:
1) IOError: [Errno 2] No such file or directory: '/Users/jurica/GensimDTM/dtmPrefix/train-mult.dat' 

which appears if i envoke model = gensim.models.wrappers.DtmModel(binaryFolder, corpus, my_timeslices, num_topics=20, id2word=dict, initialize_lda=True, prefix=prefix) with the prefix attribute and the folder defined in prefix does not exists (meaning it is not created automatically). 

2) when I run the script as sudo i get 
dyn086164:GensimDTM jurica$ sudo python gensimDTM.py
Password:
DEBUG:root:test
INFO:gensim.corpora.sharded_corpus:Could not import Theano, will use standard float for default ShardedCorpus dtype.
INFO:summa.preprocessing.cleaner:'pattern' package not found; tag filters are not available for English
INFO:gensim.utils:loading Dictionary object from /Users/jurica/GensimDTM/extractedData_Temp_NoSem_250T_10P.dict
INFO:gensim.corpora.indexedcorpus:loaded corpus index from /Users/jurica/GensimDTM/extractedData_Temp_NoSem_250T_10P.mm.index
INFO:gensim.matutils:initializing corpus reader from /Users/jurica/GensimDTM/extractedData_Temp_NoSem_250T_10P.mm
INFO:gensim.matutils:accepted corpus with 122852 documents, 228068 features, 7892412 non-zero entries
Dictionary(228068 unique tokens: [u'', u'personal effects', u'descriptive methods', u'mll1 complex', u'gag']...)
MmCorpus(122852 documents, 228068 features, 7892412 non-zero entries)
/Users/jurica/GensimDTM/dtmPrefix/
INFO:gensim.models.wrappers.dtmmodel:serializing temporary corpus to /tmp/6eb64f_train-mult.dat
INFO:gensim.corpora.bleicorpus:no word id mapping provided; initializing from corpus
INFO:gensim.corpora.bleicorpus:storing corpus in Blei's LDA-C format into /tmp/6eb64f_train-mult.dat
INFO:gensim.corpora.bleicorpus:saving vocabulary of 228068 words to /tmp/6eb64f_train-mult.dat.vocab
INFO:gensim.models.wrappers.dtmmodel:training DTM with args --ntopics=20 --model=dtm  --mode=fit --initialize_lda=true --corpus_prefix=/tmp/6eb64f_train --outname=/tmp/6eb64f_train_out --alpha=0.01 --lda_max_em_iter=10 --lda_sequence_min_iter=6  --lda_sequence_max_iter=20 --top_chain_var=0.005 --rng_seed=0
Traceback (most recent call last):
  File "gensimDTM.py", line 21, in <module>
    model = gensim.models.wrappers.DtmModel(binaryFolder, corpus, my_timeslices, num_topics=20, id2word=dict, initialize_lda=True)
  File "/Users/jurica/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/gensim/models/wrappers/dtmmodel.py", line 121, in __init__
    self.train(corpus, time_slices, mode, model)
  File "/Users/jurica/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/gensim/models/wrappers/dtmmodel.py", line 192, in train
    p = Popen([self.dtm_path] + arguments.split(), stdout=PIPE, stderr=PIPE)
  File "/Applications/Canopy.app/appdata/canopy-1.5.5.3123.macosx-x86_64/Canopy.app/Contents/lib/python2.7/subprocess.py", line 710, in __init__
    errread, errwrite)
  File "/Applications/Canopy.app/appdata/canopy-1.5.5.3123.macosx-x86_64/Canopy.app/Contents/lib/python2.7/subprocess.py", line 1335, in _execute_child
    raise child_exception
OSError: [Errno 13] Permission denied

not sure why, as sudo, I am getting any permission denied messages... 

3) when I invoke the binary directly everything works just fine  (I am using the dat files downloaded with the binaries to test it). 


I am sorry to bother you again with this but any ideas what could be wrong would be beneficial. Artyom, could you upload your binaries as well to the example you created? 

Best,
Jurica

Lev Konstantinovskiy

unread,
Oct 5, 2015, 12:41:56 PM10/5/15
to gensim
Hi Jurica,

Could you please help me reproduce this issue? Maybe post more of the code? Maybe "ls -ltr" command to check permissions on the folders?

The code below runs on my Fedora Linux. It works both when the prefix exists and when it doesn't.
 I had to compile the DTM binary myself as the downloaded ones didn't work..

dtm_home = os.environ.get(
    'DTM_HOME', "/home/lev/dtm/")
prefix = os.path.dirname(dtm_home) + '\\dtmPrefix_new\\' # OR prefix = os.path.join(dtm_home, 'dtmPrefix_new')
model = DtmModel(dtm_path,corpus,time_seq,num_topics=2,id2word=corpus.dictionary,initialize_lda=True, prefix=prefix)

Artyom Topchyan

unread,
Oct 5, 2015, 7:23:46 PM10/5/15
to gensim

I googled and this may be related to the permissions on the binary do you have exec rights on it?

On Friday, September 11, 2015 at 6:24:15 PM UTC+2, Jurica Seva wrote:

Tianran Hu

unread,
Oct 5, 2015, 9:59:42 PM10/5/15
to gensim
Hi Artyom,

I ran the example you posted, same IOError appears:

IOError: [Errno 2] No such file or directory: '/tmp/49f79a_train_out/em_log.dat'

And in the example, by "model = DtmModel(dtm_path...)" do you actually mean "model = dtmmodel.DtmModel(dtm_path...)" ? Since you didn't import DtmModel. 

Thanks!

Lev Konstantinovskiy

unread,
Oct 6, 2015, 12:11:37 AM10/6/15
to gensim
Hi Tianran,

Thank you for trying this code. It would be good to get to the bottom of this. The typo you highlighted is fixed in an updated tutorial version merge candidate in https://github.com/piskvorky/gensim/blob/Arttii-develop/docs/notebooks/dtm_example.ipynb 
Does it work for you?

And about the IOError. Are you running with initialize_lda=False? That is the workaround for IOError on em_log.dat.

Tianran Hu

unread,
Oct 6, 2015, 1:03:58 PM10/6/15
to gensim
Hi Lve,

Thanks for your reply.

-- Since I just updated gensim, I need to import DtmModel as "from gensim.models.dtmmodel import DtmModel".   "from gensim.models.dtmmodel import DtmModel"  Doesn't work for me.

-- I set initialize_lda = True, still the IOError.

Thanks!

Lev Konstantinovskiy

unread,
Oct 8, 2015, 9:23:31 AM10/8/15
to gensim

1) Thanks for pointing out the renaming error in the tutorial. Now corrected


2) Does your dtm_path point to the binary or to the folder? I had this error when pointing to the folder. Expected value is exactly the dtm binary.


3) About "init_from_lda=False" mode. When I tried running the binary directly with "init_from_lda=False", I got "Error opening file /tmp/a65419_train_out/initial-lda-ss.dat. Failing." It is because DTM is trying load an existing LDA model from that file. When the file is provided it runs correctly.The same error happens when the command is run from gensim. 


In PR #476 I added error forwarding via (backported) subprocess.check_output to DTM and Mallet wrappers so this will be more obvious in the future.   

Reply all
Reply to author
Forward
0 new messages