Using Simple English wikipedia for benchmarking

Tom O'Hara

unread,

Jun 25, 2014, 5:27:08 PM6/25/14

to gen...@googlegroups.com

Hi, can you provide some benchmark figures over a recent copy of Simple English wikipedia (http://simple.wikipedia.org)? I'm having trouble with the processing the recent version of full Wikipedia (April 2014), and I'm not sure if it's due just to the increased size or to the BLAS support (or both). For instance, even with just 100 topics, LSA is taking over a day to process the entire corpus when using OpenBlas).

In addition, if you can keep a copy of the Simple English wikipedia dump online somewhere, it would help others reproduce older benchmarks. The current Simple English dump is only 95 MB (see below) compared to 10 GB for full English Wikipedia. Unfortunately, the dumps under wikimedia.org are less than two years old; and I could not find a copy corresponding to your existing benchmarks (e.g., June 2010).

Tom

p.s., Thanks for making Gensim toolkit available: the LSA and LDA support is very handy!

----------

via http://dumps.wikimedia.org/simplewiki/20140623:

2014-06-23 17:37:03 done Articles, templates, media/file descriptions, and primary meta-pages.

simplewiki-20140623-pages-articles.xml.bz2 95.4 MB

Radim Řehůřek

unread,

Jun 26, 2014, 3:07:44 PM6/26/14

to gen...@googlegroups.com

Hello Tom,

On Wednesday, June 25, 2014 11:27:08 PM UTC+2, Tom O'Hara wrote:

Hi, can you provide some benchmark figures over a recent copy of Simple English wikipedia (http://simple.wikipedia.org)? I'm having trouble with the processing the recent version of full Wikipedia (April 2014), and I'm not sure if it's due just to the increased size or to the BLAS support (or both). For instance, even with just 100 topics, LSA is taking over a day to process the entire corpus when using OpenBlas).

can you post the training log (at DEBUG level)?

Also what's the output of `scipy.show_numpy_config()` and `scipy.show_config()`, just to make sure your OpenBlas was picked up correctly.

Best,

Radim

Tom O'Hara

unread,

Jun 26, 2014, 5:22:45 PM6/26/14

to gen...@googlegroups.com

Thanks, in the course of producing the diagnostics I noticed that Scipy might not have been configured properly, just Numpy. I had just reinstalled Numpy from a local build configured with OpenBlas, but I didn't reinstall Scipy. The before and after configuration is shown below. I just rebuilt Scipy as well, and it now the takes 1/3rd of the time to perform LSA over Simple English wikipedia (400 topics).

The debugging trace is attached over Simple English wikipedia (redux-simplewiki-20140410-pages-articles.gensim-prep.lsa400.log). I'll post the full trace over Wikipedia in a day or two.

You might want to revise BLAS diagnostics notes in the distributed computing tutorial (http://radimrehurek.com/gensim/distributed.html) to use scipy.show_config().

Tom

----------

new Scipy config:

>>> scipy.show_numpy_config()

lapack_opt_info:

libraries = ['openblas', 'openblas']

library_dirs = ['/usr/local/openblas/lib']

language = f77

blas_opt_info:

libraries = ['openblas', 'openblas']

library_dirs = ['/usr/local/openblas/lib']

language = f77

openblas_info:

libraries = ['openblas', 'openblas']

library_dirs = ['/usr/local/openblas/lib']

language = f77

blas_mkl_info:

NOT AVAILABLE

>>> scipy.show_config()

lapack_opt_info:

libraries = ['openblas', 'openblas']

library_dirs = ['/usr/local/openblas/lib']

language = f77

blas_opt_info:

libraries = ['openblas', 'openblas']

library_dirs = ['/usr/local/openblas/lib']

language = f77

openblas_info:

libraries = ['openblas', 'openblas']

library_dirs = ['/usr/local/openblas/lib']

language = f77

blas_mkl_info:

NOT AVAILABLE

$ lscpu

Architecture: x86_64

CPU op-mode(s): 64-bit

CPU(s): 2

Thread(s) per core: 1

Core(s) per socket: 1

CPU socket(s): 2

Vendor ID: GenuineIntel

CPU family: 6

Model: 26

Stepping: 5

CPU MHz: 2266.746

Hypervisor vendor: Xen

Virtualization type: para

L1d cache: 32K

L1i cache: 32K

L2 cache: 256K

L3 cache: 4096K

old Scipy config:

>>> scipy.show_config()

blas_info:

libraries = ['blas']

library_dirs = ['/usr/lib64']

language = f77

amd_info:

libraries = ['amd']

library_dirs = ['/usr/lib64']

define_macros = [('SCIPY_AMD_H', None)]

swig_opts = ['-I/usr/include/suitesparse']

include_dirs = ['/usr/include/suitesparse']

lapack_info:

libraries = ['lapack']

library_dirs = ['/usr/lib64']

language = f77

atlas_threads_info:

NOT AVAILABLE

blas_opt_info:

libraries = ['blas']

library_dirs = ['/usr/lib64']

language = f77

define_macros = [('NO_ATLAS_INFO', 1)]

atlas_blas_threads_info:

NOT AVAILABLE

umfpack_info:

libraries = ['umfpack', 'amd']

library_dirs = ['/usr/lib64']

define_macros = [('SCIPY_UMFPACK_H', None), ('SCIPY_AMD_H', None)]

swig_opts = ['-I/usr/include/suitesparse', '-I/usr/include/suitesparse']

include_dirs = ['/usr/include/suitesparse']

lapack_opt_info:

libraries = ['lapack', 'blas']

library_dirs = ['/usr/lib64']

language = f77

define_macros = [('NO_ATLAS_INFO', 1)]

atlas_info:

NOT AVAILABLE

lapack_mkl_info:

NOT AVAILABLE

blas_mkl_info:

NOT AVAILABLE

atlas_blas_info:

NOT AVAILABLE

mkl_info:

NOT AVAILABLE

redux-simplewiki-20140410-pages-articles.gensim-prep.lsa400.log

Radim Řehůřek

unread,

Jun 27, 2014, 4:24:13 AM6/27/14

to gen...@googlegroups.com

Hello Tom,

ok, cool. I will also try the "simplewiki" corpus and post my timings.

In your log I see some scipy deprecation warning about sparsetools, which is scary. Must be some new scipy thing. Gensim depends on scipy's sparsetools heavily, I'll have to investigate.

You might want to revise BLAS diagnostics notes in the distributed computing tutorial (http://radimrehurek.com/gensim/distributed.html) to use scipy.show_config().

Done, cheers!

Radim

Tom O'Hara

unread,

Jun 30, 2014, 6:47:48 PM6/30/14

to gen...@googlegroups.com

Thanks.

Note that the full wikipedia run seems to be getting stuck a QR computation. See below. The log file is attached.

Tom

$ egrep 'processed.*[05]00000' redux-enwiki-20140402-pages-articles.gensim-prep.lsa100.log
2014-06-27 00:23:50,188 : INFO : processed documents up to #500000
2014-06-27 02:06:33,279 : INFO : processed documents up to #1000000
2014-06-27 03:31:37,249 : INFO : processed documents up to #1500000
2014-06-27 04:48:59,445 : INFO : processed documents up to #2000000
2014-06-27 06:06:06,657 : INFO : processed documents up to #2500000

$ egrep 'processed|QR' /c/temp/redux-enwiki-20140402-pages-articles.gensim-prep.lsa100.log | tail
2014-06-27 06:19:41,163 : DEBUG : computing QR of (100000, 200) dense matrix
2014-06-27 06:20:11,732 : DEBUG : computing QR of (100000, 200) dense matrix
2014-06-27 06:20:54,074 : DEBUG : computing QR of (100000, 100) dense matrix
2014-06-27 06:21:07,718 : INFO : processed documents up to #2600000
2014-06-27 06:22:18,697 : DEBUG : computing QR of (100000, 200) dense matrix
2014-06-27 06:22:45,132 : DEBUG : computing QR of (100000, 200) dense matrix
2014-06-27 06:23:12,589 : DEBUG : computing QR of (100000, 200) dense matrix
2014-06-27 06:23:48,110 : DEBUG : computing QR of (100000, 100) dense matrix
2014-06-27 06:24:02,651 : INFO : processed documents up to #2620000
2014-06-27 06:25:27,187 : DEBUG : computing QR of (100000, 200) dense matrix

$ tail -5 redux-enwiki-20140402-pages-articles.gensim-prep.lsa100.log
2014-06-27 06:25:17,304 : DEBUG : converting corpus to csc format
2014-06-27 06:25:19,761 : INFO : using 100 extra samples and 2 power iterations
2014-06-27 06:25:19,762 : INFO : 1st phase: constructing (100000, 200) action matrix
2014-06-27 06:25:25,673 : INFO : orthonormalizing (100000, 200) action matrix
2014-06-27 06:25:27,187 : DEBUG : computing QR of (100000, 200) dense matrix

redux-enwiki-20140402-pages-articles.gensim-prep.lsa100.log

Radim Řehůřek

unread,

Jul 1, 2014, 6:15:16 AM7/1/14

to gen...@googlegroups.com

Tom, when you say "stuck", do you mean it stops completely (CPU at 0%, program hangs) or that the QRs takes longer, relative to other parts?

Because yes, the partial QRs *are* the slowest part of the online SVD computation :)

If it hangs it's not good though. Is your openblas using all available cores, during these QR? What's your CPU usage at?

Best,

Radim

Tom O'Hara

unread,

Jul 1, 2014, 4:29:30 PM7/1/14

to gen...@googlegroups.com, Radim Řehůřek

It was hanging with python using 100% CPU: there was no update to the log for 3 days. It looks like BLAS is not utilizing all available CPU''s. See below.

At least using OpenBlas sped things up, because it only took about 8 hours to get to the 2.6M document point, compared to about 32 hours previously. With this version of Wikipedia (2 Apr 14), 3.5 M total documents get processed.

Tom

$ toptop - 22:34:22 up 181 days, 23:11, 12 users, load average: 1.00, 1.00, 1.00Tasks: 133 total, 2 running, 131 sleeping, 0 stopped, 0 zombieCpu(s): 3.1%us, 1.3%sy, 2.0%ni, 90.9%id, 0.5%wa, 0.0%hi, 0.0%si, 2.1%stMem: 15749356k total, 11383560k used, 4365796k free, 141884k buffersSwap: 0k total, 0k used, 0k free, 6833704k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 23935 root 5 -15 3662m 3.4g 6780 R 95 22.8 6339:29 python 615 root 5 -15 19236 1300 952 R 2 0.0 0:00.01 top 1 root 20 0 23824 1364 624 S 0 0.0 0:21.73 init

$ lscpuArchitecture: x86_64CPU op-mode(s): 64-bit

CPU(s): 4

Thread(s) per core: 1Core(s) per socket: 1

CPU socket(s): 4

Vendor ID: GenuineIntelCPU family: 6

Model: 23Stepping: 10CPU MHz: 2659.998

Hypervisor vendor: XenVirtualization type: paraL1d cache: 32KL1i cache: 32K

L2 cache: 6144K

--
You received this message because you are subscribed to a topic in the Google Groups "gensim" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gensim/C8iOyHPbYgE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Radim Řehůřek

unread,

Jul 8, 2014, 1:39:38 PM7/8/14

to gen...@googlegroups.com, m...@radimrehurek.com

Hi Tom,

I finally had time to run and time LSI on the simplewiki corpus. Sorry for the delay!

1. Building the dictionary & corpus from the bzipped XML takes ~10 minutes.

The result is a 111,516 docs x 28,139 vocab TF-IDF matrix, with 6,284,690 nonzeros. Full log here: https://gist.github.com/piskvorky/eaa837b370b8543e8576

2. Building an LSI model of 200 topics from this matrix took less than 2 minutes:

>>> import logging, gensim, bz2

>>> id2word = gensim.corpora.Dictionary.load_from_text('simplewiki_en_wordids.txt.bz2')

>>> mm = gensim.corpora.MmCorpus('simplewiki_en_tfidf.mm')

>>> time lsi = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, num_topics=200)

CPU times: user 2min 55s, sys: 6.28 s, total: 3min 1s

Wall time: 1min 45s

I followed my own tutorial here http://radimrehurek.com/gensim/wiki.html . Computer: MacbookPro 9.1, i7 2.3GHz with 4 cores, Apple's preinstalled Accelerate framework for BLAS.

Let me know how things worked out for you, and how your numbers compare to mine. 3 days sound really excessive, when it takes <12 minutes on my laptop all put together! I suspect some serious error, somewhere.

Best,

Radim

To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.

Tom O'Hara

unread,

Jul 10, 2014, 7:45:41 PM7/10/14

to gen...@googlegroups.com, m...@radimrehurek.com

Thanks for posting those results over Simple English wikipedia (simplewiki). Note that I am only having hangup's with the full English wikipedia (enwiki). It took me an hour to preprocess simplewiki and then about 8 minutes to run LSA with 200 topics. So in general it is taking me 4x as long. This might be partly related to running in a VM (with 3 virtual CPU's) under a Dell quad-core laptop . I'll have to see if there are any BLAS libraries better tailored to this setup. The log files are attached. A summary is shown below.

Thanks,

Tom

----------

via simplewiki-20140410-pages-articles.gensim-prep.debug.log:

saved 64509x27757 matrix, density=0.318% (5690569/1790576313)
...
2014-07-09 13:39:30,758 : INFO : calculating IDF weights for 64509 documents and 27756 features (5690569 matrix non-zeros)
...
3120.84user 92.50system 57:51.81elapsed 92%CPU (0avgtext+0avgdata 282888maxresident)k
376inputs+0outputs (1major+192157minor)pagefaults 0swaps

via simplewiki-20140410-pages-articles.gensim-prep.lsa200.log:

mm = MmCorpus(64509 documents, 27757 features, 5690569 non-zero entries)
...
lsa=LsiModel(num_terms=27757, num_topics=200, decay=1.0, chunksize=20000)
...
596.28user 40.13system 7:55.57elapsed 133%CPU (0avgtext+0avgdata 933848maxresident)k
64inputs+0outputs (1major+1534664minor)pagefaults 0swaps

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1

Vendor ID: GenuineIntel
CPU family: 6

Model:                 42
Stepping:              7
CPU MHz:               0.000
BogoMIPS:              1580.03
L1d cache:             32K
L1d cache:             32K
L2d cache:             6144K
NUMA node0 CPU(s):     0-3

To unsubscribe from this group and all its topics, send an email to gensim+un...@googlegroups.com.

simplewiki-20140410-pages-articles.gensim-prep.lsa200.log

simplewiki-20140410-pages-articles.gensim-prep.lsa200.summary.txt

simplewiki-20140410-pages-articles.gensim-prep.debug.log.gz

Tom O'Hara

unread,

Jul 16, 2014, 7:31:18 PM7/16/14

to gen...@googlegroups.com, m...@radimrehurek.com

I just got my desktop back running from storage and tried this via Enthought Python Distribution (EPT), so it ran directly under windows (i.e., no VM).

The results over simplewiki are now comparable to yours, with just under 10 minutes for preprocessing and less than 2 minutes for LSA with 200 topics. This is on a quad-core Dell XPS 9100 computer with 24GB memory. EDP is using Intel's MLK library for BLAS. For details on the configuration, see the attached file (alt-simplewiki-20140410-pages-articles.gensim-prep.lsa200.summary.txt).

Tom

alt-simplewiki-20140410-pages-articles.gensim-prep.lsa200.summary.txt

Radim Řehůřek

unread,

Jul 17, 2014, 2:45:10 PM7/17/14

to gen...@googlegroups.com, m...@radimrehurek.com

Great news Tom!

How about that hangup on full wiki -- did anything change there?

Radim

Tom O'Hara

unread,

Jul 18, 2014, 1:43:41 AM7/18/14

to gen...@googlegroups.com

That's been on hold. With the desktop up again, I'll try it on that shortly.

By the way, great job on the word2vec optimizations! I tried that with 4 workers on the quad-core desktop, and it was nearly smoking :) It processed 320K documents is just over 20 minutes.

I had struggled a little to get cython configured properly (e.g., msvc to ming32w fix), but I didn't have to look at the task manager to see when it was finally working right. All four CPU were pegged during the core calculations.

Best,

Tom

Tom O'Hara

unread,

Jul 21, 2014, 4:12:56 PM7/21/14

to gen...@googlegroups.com

The desktop setup processed a 400-topic LSA run in under 7 hours for full Wikipedia. See the attached file for details (new-enwiki-20140402-pages-articles.gensim-prep.lsa400.log).

Tom

To unsubscribe from this group and all its topics, send an email to gensim+unsubscribe@googlegroups.com.

new-enwiki-20140402-pages-articles.gensim-prep.lsa400.log

Radim Řehůřek

unread,

Aug 7, 2014, 3:18:37 AM8/7/14

to gen...@googlegroups.com

7 hours sound reasonable to me. Congratulations :)

Intel's MKL is a good BLAS library. Wikipedia keeps getting larger all the time, so the difference compared to my laptop (it took 5.5h there) may be due to simply you processing more data.

Tom, I take it the "hanging" issue was entirely due to running under a VM then?

Radim

...

Tom O'Hara

unread,

Aug 11, 2014, 6:05:59 PM8/11/14

to gen...@googlegroups.com

Right, that's good performance given the amount of data. However, sooner or later with Wikipedia, Murphy's Law will outpace Moore's Law!

I suspect it is not a VM issue. I'm retrying on a beefier VM, to see how that works.

In addition, I can try to make a copy of the matrix that is causing QR decomposition to hang. Shouldn't it just return an error if the matrix can't be decomposed?

Tom

Radim Řehůřek

unread,

Aug 12, 2014, 2:10:37 AM8/12/14

to gen...@googlegroups.com

Hi Tom,

thanks for getting back.

On Tuesday, August 12, 2014 1:05:59 AM UTC+3, Tom O'Hara wrote:

Right, that's good performance given the amount of data. However, sooner or later with Wikipedia, Murphy's Law will outpace Moore's Law!

I suspect it is not a VM issue. I'm retrying on a beefier VM, to see how that works.

ok, let us know.

In addition, I can try to make a copy of the matrix that is causing QR decomposition to hang. Shouldn't it just return an error if the matrix can't be decomposed?

the matrix can always be decomposed; this must be a technical error (inside NumPy?), not "mathematical" one.