convert wikipedia dump to text using python -m gensim.scripts.make_wiki

415 views
Skip to first unread message

Armin Tabari

unread,
Apr 6, 2016, 5:34:43 PM4/6/16
to gensim

I want to use gensim to convert wikipedia dump to plain text using "python -m gensim.scripts.make_wiki" script.

It does not create an output!

I use it as :

python -m gensim.scripts.make_wikicorpus '/home/armin/workspace/wikipediadata/enwiki-latest-pages-articles.xml.bz2' '/home/armin/workspace/wikipediadata/results'

Does anybody know what is going on?

Lev Konstantinovskiy

unread,
Apr 7, 2016, 2:43:48 PM4/7/16
to gensim
Could it be running out of RAM like in this question?

Armin Tabari

unread,
Apr 7, 2016, 3:26:35 PM4/7/16
to gensim
No, I have 128GB of RAM :)

The error is this:

2016-04-06 20:43:46,471 : INFO : storing corpus in Matrix Market format to ./results/_bow.mm
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/scripts/make_wiki.py", line 88, in <module>
    MmCorpus.serialize(outp + '_bow.mm', wiki, progress_cnt=10000) # another ~9h
  File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/corpora/indexedcorpus.py", line 89, in serialize
    offsets = serializer.save_corpus(fname, corpus, id2word, progress_cnt=progress_cnt, metadata=metadata)
  File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/corpora/mmcorpus.py", line 49, in save_corpus
    return matutils.MmWriter.write_corpus(fname, corpus, num_terms=num_terms, index=True, progress_cnt=progress_cnt, metadata=metadata)
  File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/matutils.py", line 486, in write_corpus
    mw = MmWriter(fname)
  File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/matutils.py", line 436, in __init__
    self.fout = utils.smart_open(self.fname, 'wb+') # open for both reading and writing
  File "build/bdist.linux-x86_64/egg/smart_open/smart_open_lib.py", line 111, in smart_open
NotImplementedError: unknown file mode wb+

Radim Řehůřek

unread,
Apr 8, 2016, 3:48:00 AM4/8/16
to gensim
IIRC this is caused by using an old version of smart_open. Try upgrading it (and possibly gensim too, in case you have some old version).

HTH,
Radim

Lev Konstantinovskiy

unread,
Apr 11, 2016, 6:50:56 AM4/11/16
to gensim
Hi Armin,

Did a smart_open upgrade fix your problem?

Armin Tabari

unread,
Apr 11, 2016, 10:26:31 AM4/11/16
to gen...@googlegroups.com

Thank you, yes the problem was with the smart_open.


--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/KqI-Ft6S8lA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages