python 3 port

178 views
Skip to first unread message

Radim Řehůřek

unread,
Apr 20, 2014, 12:09:00 PM4/20/14
to gen...@googlegroups.com, Stephanus van Schalkwyk, er...@vantill.com
Hi all,

yesterday I finished the last pieces of the python 3 port of gensim. All tests pass now, in python 2.6, 2.7 and 3.3.

The next release of gensim will therefore be the first "py3k-compatible" release!

Big thanks to Lars for initiating the process and helping out.

Steph, Erik: could you please run all unit tests on a windows machine and report the result?

The code is in a separate git branch called "py3k" for now, at https://github.com/piskvorky/gensim/pull/196 . Let me know if you need any assistance with the git shenanigans or testing.

Best,
Radim

Radim Řehůřek

unread,
Apr 20, 2014, 12:19:01 PM4/20/14
to gen...@googlegroups.com, Parikshit Samant
I forgot to say: for the curious, the port uses the "six" helper Python library [1] to achieve dual compatibility = both python2.6+ and python3.3+ using the same code base.

This is a different approach to Parikshit's gensim port [2] some months ago, which created a one-way, separate fork for the python 3 code (and which is now woefully out of sync with the latest gensim).

Having a dual code base should cut down maintenance costs and have both Pythons supported at the same level.

This is a fairly major change, so I'll be bumping the main version to 0.10.0 for the next release. To remind everyone to check carefully, esp. before upgrading on production systems.

Best,
Radim

Christopher Corley

unread,
Apr 20, 2014, 5:18:53 PM4/20/14
to gensim
Great! Gensim being Py2 only is one of the few things holding me back
from doing all of my thesis work in Py3. Thanks all, and good work.

Chris.

Excerpts from Radim Řehůřek's message of 2014-04-20 11:19:01 -0500:

Steph van Schalkwyk

unread,
Apr 21, 2014, 12:12:21 AM4/21/14
to Radim Řehůřek, gen...@googlegroups.com, er...@vantill.com
On Windows, Anaconda x64 3.3:
[py3k] C:\gensim-py3k>python setup.py test
running test
running egg_info
creating gensim.egg-info
writing requirements to gensim.egg-info\requires.txt
writing gensim.egg-info\PKG-INFO
writing dependency_links to gensim.egg-info\dependency_links.txt
writing top-level names to gensim.egg-info\top_level.txt
writing manifest file 'gensim.egg-info\SOURCES.txt'
reading manifest file 'gensim.egg-info\SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no files found matching '*.sh' under directory '.'
writing manifest file 'gensim.egg-info\SOURCES.txt'
running build_ext
test_load (gensim.test.test_corpora.TestBleiCorpus) ... ok
test_save (gensim.test.test_corpora.TestBleiCorpus) ... ok
test_serialize (gensim.test.test_corpora.TestBleiCorpus) ... ok
test_serialize_compressed (gensim.test.test_corpora.TestBleiCorpus) ... ok
test_load (gensim.test.test_corpora.TestLowCorpus) ... ok
test_save (gensim.test.test_corpora.TestLowCorpus) ... ok
test_serialize (gensim.test.test_corpora.TestLowCorpus) ... ok
test_serialize_compressed (gensim.test.test_corpora.TestLowCorpus) ... ok
test_load (gensim.test.test_corpora.TestMalletCorpus) ... ok
test_load_with_metadata (gensim.test.test_corpora.TestMalletCorpus) ... ok
test_save (gensim.test.test_corpora.TestMalletCorpus) ... ok
test_serialize (gensim.test.test_corpora.TestMalletCorpus) ... ok
test_serialize_compressed (gensim.test.test_corpora.TestMalletCorpus) ... ok
test_load (gensim.test.test_corpora.TestMmCorpus) ... ok
test_save (gensim.test.test_corpora.TestMmCorpus) ... ok
test_serialize (gensim.test.test_corpora.TestMmCorpus) ... c:\gensim-py3k\gensim
\corpora\indexedcorpus.py:118: ResourceWarning: unclosed file <_io.TextIOWrapper
 name='c:\\users\\svansc~1\\appdata\\local\\temp\\2\\gensim_corpus.tst' mode='r'
 encoding='cp1252'>
  return self.docbyoffset(self.index[docno])
ok
test_serialize_compressed (gensim.test.test_corpora.TestMmCorpus) ... ok
test_load (gensim.test.test_corpora.TestSvmLightCorpus) ... ok
test_save (gensim.test.test_corpora.TestSvmLightCorpus) ... ok
test_serialize (gensim.test.test_corpora.TestSvmLightCorpus) ... ok
test_serialize_compressed (gensim.test.test_corpora.TestSvmLightCorpus) ... ok
test_load (gensim.test.test_corpora.TestUciCorpus) ... ok
test_save (gensim.test.test_corpora.TestUciCorpus) ... ok
test_serialize (gensim.test.test_corpora.TestUciCorpus) ... ok
test_serialize_compressed (gensim.test.test_corpora.TestUciCorpus) ... ok
testBuild (gensim.test.test_corpora_dictionary.TestDictionary) ... ok
testDocFreqAndToken2IdForSeveralDocsWithOneWord (gensim.test.test_corpora_dictio
nary.TestDictionary) ... ok
testDocFreqForOneDocWithSeveralWord (gensim.test.test_corpora_dictionary.TestDic
tionary) ... ok
testDocFreqOneDoc (gensim.test.test_corpora_dictionary.TestDictionary) ... ok
testFilter (gensim.test.test_corpora_dictionary.TestDictionary) ... ok
test_dict_interface (gensim.test.test_corpora_dictionary.TestDictionary)
Test Python 2 dict-like interface in both Python 2 and 3. ... ok
test_doc2bow (gensim.test.test_corpora_dictionary.TestDictionary) ... ok
test_from_corpus (gensim.test.test_corpora_dictionary.TestDictionary)
build `Dictionary` from an existing corpus ... ok
test_saveAsText_and_loadFromText (gensim.test.test_corpora_dictionary.TestDictio
nary)
`Dictionary` can be saved as textfile and loaded again from textfile. ... ok
testBuild (gensim.test.test_corpora_hashdictionary.TestHashDictionary) ... ok
testDebugMode (gensim.test.test_corpora_hashdictionary.TestHashDictionary) ... o
k
testDocFreqAndToken2IdForSeveralDocsWithOneWord (gensim.test.test_corpora_hashdi
ctionary.TestHashDictionary) ... ok
testDocFreqForOneDocWithSeveralWord (gensim.test.test_corpora_hashdictionary.Tes
tHashDictionary) ... ok
testDocFreqOneDoc (gensim.test.test_corpora_hashdictionary.TestHashDictionary) .
.. ok
testFilter (gensim.test.test_corpora_hashdictionary.TestHashDictionary) ... ok
testRange (gensim.test.test_corpora_hashdictionary.TestHashDictionary) ... ok
test_saveAsText (gensim.test.test_corpora_hashdictionary.TestHashDictionary)
`HashDictionary` can be saved as textfile. ... ERROR
test_saveAsTextBz2 (gensim.test.test_corpora_hashdictionary.TestHashDictionary)
`HashDictionary` can be saved & loaded as compressed pickle. ... ok
test_corpus (gensim.test.test_lee.TestLeeTest)
availability and integrity of corpus ... ok
test_lee (gensim.test.test_lee.TestLeeTest)
correlation with human data > 0.6 ... ok
test_miislita_high_level (gensim.test.test_miislita.TestMiislita) ... ok
test_save_load_ability (gensim.test.test_miislita.TestMiislita) ... c:\gensim-py
3k\gensim\interfaces.py:60: UserWarning: corpus.save() stores only the (tiny) it
eration object; to serialize the actual corpus content, use e.g. MmCorpus.serial
ize(corpus)
  warnings.warn("corpus.save() stores only the (tiny) iteration object; "
ok
test_textcorpus (gensim.test.test_miislita.TestMiislita)
Make sure TextCorpus can be serialized to disk. ... ok
testLargeMmap (gensim.test.test_models.TestLdaMallet) ... ok
testPersistence (gensim.test.test_models.TestLdaMallet) ... ok
testTransform (gensim.test.test_models.TestLdaMallet) ... ok
testLargeMmap (gensim.test.test_models.TestLdaModel) ... WARNING:gensim.models.l
damodel:no word id mapping provided; initializing from corpus, assuming identity

WARNING:gensim.models.ldamodel:too few updates, training might not converge; con
sider increasing the number of passes or iterations to improve accuracy
c:\gensim-py3k\gensim\models\ldamodel.py:636: DeprecationWarning: using a non-in
teger number instead of an integer will result in an error in the future
  score += numpy.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, id]) for id, cnt i
n doc)
ok
testPersistence (gensim.test.test_models.TestLdaModel) ... WARNING:gensim.models
.ldamodel:no word id mapping provided; initializing from corpus, assuming identi
ty
WARNING:gensim.models.ldamodel:too few updates, training might not converge; con
sider increasing the number of passes or iterations to improve accuracy
ok
testTransform (gensim.test.test_models.TestLdaModel) ... ok
testPersistence (gensim.test.test_models.TestLogEntropyModel) ... ok
testTransform (gensim.test.test_models.TestLogEntropyModel) ... ok
testCorpusTransform (gensim.test.test_models.TestLsiModel)
Test lsi[corpus] transformation. ... WARNING:gensim.models.lsimodel:no word id m
apping provided; initializing from corpus, assuming identity
ok
testLargeMmap (gensim.test.test_models.TestLsiModel) ... WARNING:gensim.models.l
simodel:no word id mapping provided; initializing from corpus, assuming identity

C:\Anaconda\envs\py3k\lib\site-packages\scipy\sparse\compressed.py:122: UserWarn
ing: indices array has non-integer dtype (float64)
ok
testOnlineTransform (gensim.test.test_models.TestLsiModel) ... WARNING:gensim.mo
dels.lsimodel:no word id mapping provided; initializing from corpus, assuming id
entity
ok
testPersistence (gensim.test.test_models.TestLsiModel) ... WARNING:gensim.models
.lsimodel:no word id mapping provided; initializing from corpus, assuming identi
ty
ok
testTransform (gensim.test.test_models.TestLsiModel)
Test lsi[vector] transformation. ... WARNING:gensim.models.lsimodel:no word id m
apping provided; initializing from corpus, assuming identity
ok
testPersistence (gensim.test.test_models.TestRpModel) ... ok
testTransform (gensim.test.test_models.TestRpModel) ... ok
testInit (gensim.test.test_models.TestTfidfModel) ... ok
testPersistence (gensim.test.test_models.TestTfidfModel) ... ok
testTransform (gensim.test.test_models.TestTfidfModel) ... ok
testSplitAlphanum (gensim.test.test_parsing.TestPreprocessing) ... ok
testStemText (gensim.test.test_parsing.TestPreprocessing) ... ok
testStripMultipleWhitespaces (gensim.test.test_parsing.TestPreprocessing) ... ok

testStripNonAlphanum (gensim.test.test_parsing.TestPreprocessing) ... ok
testStripNumeric (gensim.test.test_parsing.TestPreprocessing) ... ok
testStripShort (gensim.test.test_parsing.TestPreprocessing) ... ok
testStripStopwords (gensim.test.test_parsing.TestPreprocessing) ... ok
testStripTags (gensim.test.test_parsing.TestPreprocessing) ... ok
testChunking (gensim.test.test_similarities.TestMatrixSimilarity) ... ok
testFull (gensim.test.test_similarities.TestMatrixSimilarity) ... ok
testIter (gensim.test.test_similarities.TestMatrixSimilarity) ... ok
testLarge (gensim.test.test_similarities.TestMatrixSimilarity) ... ok
testMmap (gensim.test.test_similarities.TestMatrixSimilarity) ... ok
testNumBest (gensim.test.test_similarities.TestMatrixSimilarity) ... ok
testPersistency (gensim.test.test_similarities.TestMatrixSimilarity) ... ok
testChunking (gensim.test.test_similarities.TestSimilarity) ... ok
testFull (gensim.test.test_similarities.TestSimilarity) ... ok
testIter (gensim.test.test_similarities.TestSimilarity) ... ok
testLarge (gensim.test.test_similarities.TestSimilarity) ... ok
testMmap (gensim.test.test_similarities.TestSimilarity) ... ok
testNumBest (gensim.test.test_similarities.TestSimilarity) ... ok
testPersistency (gensim.test.test_similarities.TestSimilarity) ... ok
testReopen (gensim.test.test_similarities.TestSimilarity)
test re-opening partially full shards ... ok
testSharding (gensim.test.test_similarities.TestSimilarity) ... ok
testChunking (gensim.test.test_similarities.TestSparseMatrixSimilarity) ... ok
testFull (gensim.test.test_similarities.TestSparseMatrixSimilarity) ... ok
testIter (gensim.test.test_similarities.TestSparseMatrixSimilarity) ... ok
testLarge (gensim.test.test_similarities.TestSparseMatrixSimilarity) ... ok
testMmap (gensim.test.test_similarities.TestSparseMatrixSimilarity) ... ok
testNumBest (gensim.test.test_similarities.TestSparseMatrixSimilarity) ... ok
testPersistency (gensim.test.test_similarities.TestSparseMatrixSimilarity) ... o
k
test_None (gensim.test.test_utils.TestIsCorpus) ... ok
test_int_tuples (gensim.test.test_utils.TestIsCorpus) ... ok
test_invalid_formats (gensim.test.test_utils.TestIsCorpus) ... ok
test_simple_lists_of_tuples (gensim.test.test_utils.TestIsCorpus) ... ok
testLargeMmap (gensim.test.test_word2vec.TestWord2VecModel)
Test storing/loading the entire model. ... c:\gensim-py3k\gensim\models\word2vec
.py:303: UserWarning: Cython compilation failed, training will be slow. Do you h
ave Cython installed? `pip install cython`
  warnings.warn("Cython compilation failed, training will be slow. Do you have C
ython installed? `pip install cython`")
ok
testParallel (gensim.test.test_word2vec.TestWord2VecModel)
Test word2vec parallel training. ... ok
testPersistence (gensim.test.test_word2vec.TestWord2VecModel)
Test storing/loading the entire model. ... ok
testPersistenceWord2VecFormat (gensim.test.test_word2vec.TestWord2VecModel)
Test storing/loading the entire model in word2vec format. ... ok
testPersistenceWord2VecFormatWithVocab (gensim.test.test_word2vec.TestWord2VecMo
del)
Test storing/loading the entire model and vocabulary in word2vec format. ... ok
testRNG (gensim.test.test_word2vec.TestWord2VecModel)
Test word2vec results identical with identical RNG seed. ... ok
testTraining (gensim.test.test_word2vec.TestWord2VecModel)
Test word2vec training. ... WARNING:gensim.models.word2vec:consider setting laye
r size to a multiple of 4 for greater performance
WARNING:gensim.models.word2vec:consider setting layer size to a multiple of 4 fo
r greater performance
ok
testTrainingCbow (gensim.test.test_word2vec.TestWord2VecModel)
Test CBOW word2vec training. ... WARNING:gensim.models.word2vec:consider setting
 layer size to a multiple of 4 for greater performance
WARNING:gensim.models.word2vec:consider setting layer size to a multiple of 4 fo
r greater performance
ok
testVocab (gensim.test.test_word2vec.TestWord2VecModel)
Test word2vec vocabulary building. ... ok
testLineSentenceWorksWithCompressedFile (gensim.test.test_word2vec.TestWord2VecS
entenceIterators)
Does LineSentence work with a compressed file object argument? ... ok
testLineSentenceWorksWithFilename (gensim.test.test_word2vec.TestWord2VecSentenc
eIterators)
Does LineSentence work with a filename argument? ... ok
testLineSentenceWorksWithNormalFile (gensim.test.test_word2vec.TestWord2VecSente
nceIterators)
Does LineSentence work with a file object argument, rather than filename? ... ok


======================================================================
ERROR: test_saveAsText (gensim.test.test_corpora_hashdictionary.TestHashDictiona
ry)
`HashDictionary` can be saved as textfile.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "c:\gensim-py3k\gensim\test\test_corpora_hashdictionary.py", line 167, in
 test_saveAsText
    d.save_as_text(tmpf)
  File "c:\gensim-py3k\gensim\corpora\hashdictionary.py", line 231, in save_as_t
ext
    fout.write("%i\t%i\t%s\n" % (tokenid, self.dfs.get(tokenid, 0), '\t'.join(wo
rds_df)))
  File "C:\Anaconda\envs\py3k\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0165' in position
10: character maps to <undefined>

----------------------------------------------------------------------
Ran 113 tests in 15.680s

FAILED (errors=1)

[py3k] C:\gensim-py3k>
--
+1 312 281 8982 (Tel/SMS)

Radim Řehůřek

unread,
Apr 21, 2014, 4:59:09 AM4/21/14
to gen...@googlegroups.com, Radim Řehůřek, er...@vantill.com, Stephanus van Schalkwyk
On Monday, April 21, 2014 6:12:21 AM UTC+2, Stephanus van Schalkwyk wrote:
======================================================================
ERROR: test_saveAsText (gensim.test.test_corpora_hashdictionary.TestHashDictiona
ry)
`HashDictionary` can be saved as textfile.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "c:\gensim-py3k\gensim\test\test_corpora_hashdictionary.py", line 167, in
 test_saveAsText
    d.save_as_text(tmpf)
  File "c:\gensim-py3k\gensim\corpora\hashdictionary.py", line 231, in save_as_t
ext
    fout.write("%i\t%i\t%s\n" % (tokenid, self.dfs.get(tokenid, 0), '\t'.join(wo
rds_df)))
  File "C:\Anaconda\envs\py3k\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0165' in position
10: character maps to <undefined>
 

----------------------------------------------------------------------
Ran 113 tests in 15.680s

FAILED (errors=1)

Fixed!

Can you please try again Steph?

Cheers,
Radim

Valerio Maggio

unread,
Apr 21, 2014, 5:20:07 AM4/21/14
to gen...@googlegroups.com, Radim Řehůřek, er...@vantill.com, Stephanus van Schalkwyk
Hi Radim, 
    I've just checked out the latest py3k branch and tried to run all the 113 tests: each test passed and everything worked like a charm !-)

Please find in attachment the detailed report.

Please note that I used Python 3.4 installed on my Mac OSX 10.9 - Mavericks - via Homebrew.

Best,
Valerio


py3_test_report_dump.txt

Valerio Maggio

unread,
Apr 21, 2014, 5:32:28 AM4/21/14
to gen...@googlegroups.com, Radim Řehůřek, er...@vantill.com, Stephanus van Schalkwyk
Hi Radim, 
   after all the tests have passed, I also tried to install the package via setup.py, namely python3 setup.py install.

The package seemed to get installed but I got the following error:

Processing gensim-0.9.1-py3.4.egg
creating /usr/local/lib/python3.4/site-packages/gensim-0.9.1-py3.4.egg
Extracting gensim-0.9.1-py3.4.egg to /usr/local/lib/python3.4/site-packages
  File "/usr/local/lib/python3.4/site-packages/gensim-0.9.1-py3.4.egg/gensim/models/lsi_dispatcher.py", line 186
    print globals()["__doc__"] % locals()
                ^
SyntaxError: invalid syntax

I then slightly fixed this error and reinstalled `gensim`  with no problem !-) (Please find the diff snapshot in attachment)

I think that this "fix" won't justify a pull request...what do you think? :)

HTH

Best,
Valerio


Screen Shot 2014-04-21 at 11.29.18.png

Radim Řehůřek

unread,
Apr 21, 2014, 6:02:46 AM4/21/14
to gen...@googlegroups.com
Excellent, thanks Valerio.

Fixed & pushed to github.

Also thanks for testing on 3.4 -- I'll add that version to CI tests as well (we're aiming at supporting all >=3.3).

Best,
Radim

Valerio Maggio

unread,
Apr 21, 2014, 6:48:56 AM4/21/14
to gen...@googlegroups.com



On Mon, Apr 21, 2014 at 12:02 PM, Radim Řehůřek wrote:
Excellent, thanks Valerio.


You're more than welcome! :)
 
Fixed & pushed to github.

Got it! Thanks!
 

Also thanks for testing on 3.4 -- I'll add that version to CI tests as well (we're aiming at supporting all >=3.3).

And that's a very good news!
 
All the Best,
Valerio

Radim Řehůřek

unread,
Apr 21, 2014, 7:22:16 AM4/21/14
to gen...@googlegroups.com
Heh, added 3.4 auto-check but CI failed, because Travis doesn't even support 3.4 yet: https://github.com/travis-ci/travis-ci/issues/1989

We're getting too cutting edge here :)

-rr

Valerio Maggio

unread,
Apr 21, 2014, 12:24:06 PM4/21/14
to gen...@googlegroups.com
On Mon, Apr 21, 2014 at 1:22 PM, Radim Řehůřek <m...@radimrehurek.com> wrote:
Heh, added 3.4 auto-check but CI failed, because Travis doesn't even support 3.4 yet: https://github.com/travis-ci/travis-ci/issues/1989

Uuh..too bad... 
Anyway, it seems that they're still working on it and that Python 3.4 support will be added soon.

I'll try to keep you posted on this !-)
 
We're getting too cutting edge here :)

:D
gr8!

best,
vm

Steph van Schalkwyk

unread,
Apr 21, 2014, 8:49:08 PM4/21/14
to Radim Řehůřek, gen...@googlegroups.com, er...@vantill.com
Latest:

[py3k] C:\gensim-py3k>python setup.py test
running test
running egg_info
creating gensim.egg-info
writing requirements to gensim.egg-info\requires.txt
writing top-level names to gensim.egg-info\top_level.txt
writing dependency_links to gensim.egg-info\dependency_links.txt
writing gensim.egg-info\PKG-INFO
`HashDictionary` can be saved as textfile. ... ok
----------------------------------------------------------------------
Ran 113 tests in 14.964s

OK

[py3k] C:\gensim-py3k>



On Sun, Apr 20, 2014 at 11:09 AM, Radim Řehůřek <m...@radimrehurek.com> wrote:

Valerio Maggio

unread,
Apr 28, 2014, 12:14:35 PM4/28/14
to gen...@googlegroups.com


On Monday, April 21, 2014 6:24:06 PM UTC+2, Valerio Maggio wrote:



On Mon, Apr 21, 2014 at 1:22 PM, Radim Řehůřek wrote:
Heh, added 3.4 auto-check but CI failed, because Travis doesn't even support 3.4 yet: https://github.com/travis-ci/travis-ci/issues/1989

Uuh..too bad... 
Anyway, it seems that they're still working on it and that Python 3.4 support will be added soon.

I'll try to keep you posted on this !-)

Reply all
Reply to author
Forward
0 new messages