training word2vec on full Wikipedia

16,002 views
Skip to first unread message

h3im...@gmail.com

unread,
May 5, 2014, 8:29:23 PM5/5/14
to gen...@googlegroups.com
Hello,
i'm trying to build a model for Word2Vec on the full Wikipedia, i do the following:


import logging
import os.path
import sys
from gensim.corpora import Dictionary, WikiCorpus
from gensim.models import  Word2Vec

DEFAULT_DICT_SIZE = 100000

if __name__ == '__main__':
    logger = logging.getLogger(program)
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))

    inp, outp = sys.argv[1:3]
    keep_words = DEFAULT_DICT_SIZE

    wiki = WikiCorpus(inp)
    wiki.dictionary.filter_extremes(no_below=20, no_above=0.1, keep_n=DEFAULT_DICT_SIZE)
    wiki.dictionary.save_as_text(outp + '_wordids.txt')

    sentences = wiki.get_texts()
    model = Word2Vec(sentences, size=400, window=5, min_count=5, workers=4)
    model.init_sims(replace=True)
    model.save(outp)


it iterates over all the wikipedia dataset (about 10 hours), but when it comes to training the output says it is training on 0 words for 0.5 seconds with 4 workers.

Maybe the get_texts()  is a generator already finished by the dictonary filtering?

Anyqay trying the saved model (about 400mb) i get almost random results:

>>> model.most_similar(positive=['woman', 'king'], negative=['man'])
[('dhuleti', 0.23186782002449036), ('\xc3\xbcbelacker', 0.22952479124069214), ('mukundha', 0.22752583026885986), ('supraportes', 0.2274281531572342), ('poeck', 0.2248934656381607), ('investiment', 0.22386984527111053), ('pyriatyn', 0.22135017812252045), ('kusnezoffii', 0.2210342288017273), ('pmarc', 0.22030536830425262), ('larompong', 0.22000744938850403)]
>>> model.doesnt_match("breakfast cereal dinner lunch".split())
'breakfast'
>>> model.similarity('woman', 'man')
0.035978414072460468

any suggestion? What i am doing wrong?

Brent Payne

unread,
May 6, 2014, 2:05:47 AM5/6/14
to gen...@googlegroups.com
My first guess is that sentences cannot be iterated over twice, Word2Vec requires that it can run over sentences twice to be able to train the model.  Once to gather vocabulary data and the second time to train using skip-grams.  
This would explain the random behavior.  If I'm right, then replacing your training with the below would worker.


```python
model = Word2Vec(size=400, window=5, min_count=5, workers=4)
sentences = wiki.get_texts()
model.build_vocab(sentences)
sentences = wiki.get_texts()
model.train(sentences)
```

Hope that helps,
Brent Payne



--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

h3im...@gmail.com

unread,
May 6, 2014, 6:41:50 AM5/6/14
to gen...@googlegroups.com
Thankyou for your hint Brent, i also was suspecting the same thing.
I re-lounched the script with the modifications you suggested (and also removed the dictionary part).
Now it is working like it did before:

2014-05-06 12:28:39,483 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2014-05-06 12:33:27,944 : INFO : adding document #10000 to Dictionary(422302 unique tokens: ['biennials', 'tripolitan', 'oblocutor', 'maderista', 'sowell']...)
2014-05-06 12:37:38,894 : INFO : adding document #20000 to Dictionary(598739 unique tokens: ['biennials', 'tripolitan', 'oblocutor', 'shatzky', 'saire']...)

I have to wait some hours and we'll see if it is all ok. Thanks.

Radim Řehůřek

unread,
May 7, 2014, 5:05:59 AM5/7/14
to gen...@googlegroups.com
Brent is right.

Also, you can speed up the process tremendously by not building a dictionary, which is happening automatically here: `wiki = WikiCorpus(inp)`

Word2vec builds its own vocabulary structures, so this word=>id mapping is not necessary at all (unless you want to use your `wiki` corpus for other training, such as LDA, LSI etc).

You can achieve this simply by supplying an empty dictionary in the constructor: `wiki = WikiCorpus(inp, dictionary={})`

HTH,
Radim

h3im...@gmail.com

unread,
May 7, 2014, 4:54:47 PM5/7/14
to gen...@googlegroups.com
Thank you Radim.
I tried it but i still have some problems:
after about 1 day and a half of computation the process was still going, but i stopped getting output messages at about 45% of words processed.
My machine has 32gb of ram, i checked the processes and there were some python processes with 3gb allocated and one process with 8gb allocated running, but the cpu amount used was below 2% in global, while the ram was almos full (29gb used + about 100% swap).
So i stopped the process after waiting for 4 hours and not getting any output message. the average words/s started about 30k, peaked to 40k and before the last message dropped grecefully to 15k. (when buildeing with text8 dataset the average is 80k)
I think it's a problem of the ram being full, what do you think about?
I will retry tomorrow with your suggestion of the empty dictionary, i should save ram, but i don't think it will be sufficient as the number of words was extremely high even after removing those with count < 5. (i don't remember the number, i closed the terminal in order to kill the process as it didn't want to stop with ctrl+c, but it was a huge amount of words)

Radim Řehůřek

unread,
May 8, 2014, 5:05:00 AM5/8/14
to gen...@googlegroups.com, Olivier Grisel
Hello,

30k words per second and 2% CPU utilization don't sound right, with 4 workers. Hard to say what the problem is, try copypasting/gisting the full log.

I remember Olivier (in CC) was running word2vec on Wikipedia too, maybe he can give you some pointers on how much time/ resources it took for them.

Best,
Radim

h3im...@gmail.com

unread,
May 8, 2014, 7:57:31 AM5/8/14
to gen...@googlegroups.com, Olivier Grisel
Hello Radim,
thank you for your answer. I just re-started the process in order to get the output and show it to you.
Meanwhile, just to give you more context, here's some more information:
i'm using gensim 0.9.0
cython 0.20.1
numpy 1.9.0
scipy 0.13.3
i'm also using openblas:
import numpy.distutils.system_info as sysinfo
sysinfo.get_info('blas')
{'libraries': ['openblas'], 'library_dirs': ['/usr/local/lib'], 'language': 'f77'}

the wikipedia dump is the latest one of 9 march.

I'm not using pattern, but i was thinking about it in orer to reduce the number of words with lemmatization.

the full update code of my script is:


import logging
import os.path
import sys

from gensim.corpora import WikiCorpus
from gensim.models import TfidfModel, Word2Vec

if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])

    logger = logging.getLogger(program)

    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))

    if len(sys.argv) < 3:
        print globals()['__doc__'] % locals()
        sys.exit(1)

    inp, outp = sys.argv[1:3]

    wiki = WikiCorpus(inp, dictionary={})
    model = Word2Vec(size=400, window=5, min_count=5, workers=4)
    sentences = wiki.get_texts()
    model.build_vocab(sentences)
    sentences = wiki.get_texts()
    model.train(sentences)
    model.init_sims(replace=True)
    model.save(outp)

Radim Řehůřek

unread,
May 8, 2014, 9:38:13 AM5/8/14
to gen...@googlegroups.com

On Thursday, May 8, 2014 1:57:31 PM UTC+2, h3im...@gmail.com wrote:
Hello Radim,
thank you for your answer. I just re-started the process in order to get the output and show it to you.

OK, cool.

Also, I would suggest storing the extracted/parsed tokens into a separate file. Like it is now, you're extracting and parsing XML from the .bz2 on every pass. Maybe that could be the bottleneck?

If you store the extracted tokens directly, one document per line, whitespace separated utf8 tokens, you will bypass this XML and Wiki format parsing and can simply use the `word2vec.LineSentence` iterator for input:

>>> model = Word2Vec(LineSentence('text_file_with_tokens_per_line.gz'), size=400, workers=4)

HTH,
Radim

h3im...@gmail.com

unread,
May 8, 2014, 1:36:08 PM5/8/14
to gen...@googlegroups.com
Hi Radim,
i will try you suggestion as my next step. it actually makes a lot of sense and will make something more similar to text8.

Right now i have 28gb of ocupied ram, the first cpu at 100% and all the other 7 (quadcore with hypherthreading) below 4%. So yes, i also think there's a botleneck somewhere and probably is the reading/parsing speed of the wikipedia corpus.

anyway i paste here the full log so far:
2014-05-08 14:01:38,201 : INFO : running word2vec_builder.py /home/h3imdall/Downloads/enwiki-latest-pages-articles.xml.bz2 word2vec_model
2014-05-08 14:01:38,201 : INFO : collecting all words and their counts
2014-05-08 14:01:38,689 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2014-05-08 14:04:58,470 : INFO : PROGRESS: at sentence #10000, processed 28366405 words and 422302 word types
2014-05-08 14:07:51,292 : INFO : PROGRESS: at sentence #20000, processed 52854726 words and 598739 word types
2014-05-08 14:10:11,407 : INFO : PROGRESS: at sentence #30000, processed 72816166 words and 727655 word types
2014-05-08 14:12:16,155 : INFO : PROGRESS: at sentence #40000, processed 90348238 words and 843377 word types
2014-05-08 14:13:47,682 : INFO : PROGRESS: at sentence #50000, processed 102596334 words and 918337 word types
2014-05-08 14:14:56,986 : INFO : PROGRESS: at sentence #60000, processed 110656914 words and 935412 word types
2014-05-08 14:15:56,171 : INFO : PROGRESS: at sentence #70000, processed 117388273 words and 952055 word types
2014-05-08 14:16:50,977 : INFO : PROGRESS: at sentence #80000, processed 123714095 words and 966320 word types
2014-05-08 14:18:37,575 : INFO : PROGRESS: at sentence #90000, processed 138216665 words and 1046750 word types
2014-05-08 14:20:33,900 : INFO : PROGRESS: at sentence #100000, processed 154534022 words and 1139211 word types
2014-05-08 14:22:18,167 : INFO : PROGRESS: at sentence #110000, processed 169048029 words and 1225306 word types
2014-05-08 14:23:56,730 : INFO : PROGRESS: at sentence #120000, processed 182992704 words and 1302135 word types
2014-05-08 14:25:30,722 : INFO : PROGRESS: at sentence #130000, processed 195971502 words and 1366366 word types
2014-05-08 14:27:09,168 : INFO : PROGRESS: at sentence #140000, processed 209842586 words and 1439332 word types
2014-05-08 14:28:37,259 : INFO : PROGRESS: at sentence #150000, processed 222001833 words and 1519284 word types
2014-05-08 14:30:08,927 : INFO : PROGRESS: at sentence #160000, processed 234721874 words and 1590005 word types
2014-05-08 14:31:32,442 : INFO : PROGRESS: at sentence #170000, processed 246237940 words and 1647499 word types
2014-05-08 14:32:56,718 : INFO : PROGRESS: at sentence #180000, processed 257577341 words and 1696927 word types
2014-05-08 14:34:11,670 : INFO : PROGRESS: at sentence #190000, processed 267706686 words and 1749552 word types
2014-05-08 14:35:26,596 : INFO : PROGRESS: at sentence #200000, processed 277827750 words and 1804156 word types
2014-05-08 14:36:43,694 : INFO : PROGRESS: at sentence #210000, processed 288149313 words and 1852435 word types
2014-05-08 14:37:59,931 : INFO : PROGRESS: at sentence #220000, processed 298503399 words and 1895650 word types
2014-05-08 14:39:12,670 : INFO : PROGRESS: at sentence #230000, processed 308317144 words and 1943038 word types
2014-05-08 14:40:24,383 : INFO : PROGRESS: at sentence #240000, processed 317900616 words and 1990037 word types
2014-05-08 14:41:38,270 : INFO : PROGRESS: at sentence #250000, processed 327680928 words and 2030867 word types
2014-05-08 14:42:49,333 : INFO : PROGRESS: at sentence #260000, processed 336932597 words and 2070078 word types
2014-05-08 14:44:00,214 : INFO : PROGRESS: at sentence #270000, processed 345995838 words and 2108581 word types
2014-05-08 14:45:11,578 : INFO : PROGRESS: at sentence #280000, processed 354989469 words and 2149849 word types
2014-05-08 14:46:19,486 : INFO : PROGRESS: at sentence #290000, processed 363919606 words and 2192911 word types
2014-05-08 14:47:25,748 : INFO : PROGRESS: at sentence #300000, processed 372440166 words and 2232527 word types
2014-05-08 14:48:30,629 : INFO : PROGRESS: at sentence #310000, processed 380966169 words and 2274632 word types
2014-05-08 14:49:36,140 : INFO : PROGRESS: at sentence #320000, processed 389506473 words and 2320301 word types
2014-05-08 14:50:38,604 : INFO : PROGRESS: at sentence #330000, processed 397716820 words and 2353419 word types
2014-05-08 14:51:39,309 : INFO : PROGRESS: at sentence #340000, processed 405719655 words and 2386211 word types
2014-05-08 14:52:44,194 : INFO : PROGRESS: at sentence #350000, processed 413924568 words and 2418008 word types
2014-05-08 14:53:43,762 : INFO : PROGRESS: at sentence #360000, processed 421735351 words and 2448044 word types
2014-05-08 14:54:42,794 : INFO : PROGRESS: at sentence #370000, processed 429365105 words and 2474876 word types
2014-05-08 14:55:43,207 : INFO : PROGRESS: at sentence #380000, processed 437186376 words and 2508080 word types
2014-05-08 14:56:46,063 : INFO : PROGRESS: at sentence #390000, processed 445028339 words and 2538614 word types
2014-05-08 14:57:43,998 : INFO : PROGRESS: at sentence #400000, processed 452617008 words and 2568225 word types
2014-05-08 14:58:42,000 : INFO : PROGRESS: at sentence #410000, processed 460166798 words and 2601297 word types
2014-05-08 14:59:39,901 : INFO : PROGRESS: at sentence #420000, processed 467683270 words and 2632648 word types
2014-05-08 15:00:38,571 : INFO : PROGRESS: at sentence #430000, processed 475242239 words and 2662130 word types
2014-05-08 15:01:37,736 : INFO : PROGRESS: at sentence #440000, processed 482943762 words and 2695650 word types
2014-05-08 15:02:34,749 : INFO : PROGRESS: at sentence #450000, processed 490276424 words and 2723918 word types
2014-05-08 15:03:32,671 : INFO : PROGRESS: at sentence #460000, processed 497801581 words and 2755805 word types
2014-05-08 15:04:29,302 : INFO : PROGRESS: at sentence #470000, processed 504861928 words and 2781886 word types
2014-05-08 15:05:27,750 : INFO : PROGRESS: at sentence #480000, processed 511791727 words and 2811807 word types
2014-05-08 15:06:23,547 : INFO : PROGRESS: at sentence #490000, processed 518767227 words and 2838795 word types
2014-05-08 15:07:19,380 : INFO : PROGRESS: at sentence #500000, processed 525922579 words and 2867098 word types
2014-05-08 15:08:15,955 : INFO : PROGRESS: at sentence #510000, processed 533163199 words and 2893976 word types
2014-05-08 15:09:11,189 : INFO : PROGRESS: at sentence #520000, processed 540146491 words and 2921545 word types
2014-05-08 15:10:06,288 : INFO : PROGRESS: at sentence #530000, processed 547195073 words and 2950378 word types
2014-05-08 15:11:00,959 : INFO : PROGRESS: at sentence #540000, processed 553942514 words and 2979560 word types
2014-05-08 15:11:54,889 : INFO : PROGRESS: at sentence #550000, processed 560766361 words and 3006440 word types
2014-05-08 15:12:48,289 : INFO : PROGRESS: at sentence #560000, processed 567519993 words and 3033185 word types
2014-05-08 15:13:41,122 : INFO : PROGRESS: at sentence #570000, processed 574282113 words and 3060034 word types
2014-05-08 15:14:33,559 : INFO : PROGRESS: at sentence #580000, processed 580860734 words and 3083775 word types
2014-05-08 15:15:26,062 : INFO : PROGRESS: at sentence #590000, processed 587264960 words and 3106893 word types
2014-05-08 15:16:19,131 : INFO : PROGRESS: at sentence #600000, processed 593801228 words and 3134824 word types
2014-05-08 15:17:10,988 : INFO : PROGRESS: at sentence #610000, processed 600088134 words and 3159544 word types
2014-05-08 15:18:01,883 : INFO : PROGRESS: at sentence #620000, processed 606452627 words and 3183376 word types
2014-05-08 15:18:51,743 : INFO : PROGRESS: at sentence #630000, processed 612544701 words and 3206856 word types
2014-05-08 15:19:41,149 : INFO : PROGRESS: at sentence #640000, processed 618597910 words and 3230664 word types
2014-05-08 15:20:31,389 : INFO : PROGRESS: at sentence #650000, processed 624623720 words and 3255563 word types
2014-05-08 15:21:20,337 : INFO : PROGRESS: at sentence #660000, processed 630657298 words and 3278420 word types
2014-05-08 15:22:08,351 : INFO : PROGRESS: at sentence #670000, processed 636574004 words and 3310340 word types
2014-05-08 15:22:56,316 : INFO : PROGRESS: at sentence #680000, processed 642449839 words and 3331809 word types
2014-05-08 15:23:43,270 : INFO : PROGRESS: at sentence #690000, processed 648165441 words and 3353326 word types
2014-05-08 15:24:31,198 : INFO : PROGRESS: at sentence #700000, processed 654110728 words and 3374797 word types
2014-05-08 15:25:20,071 : INFO : PROGRESS: at sentence #710000, processed 660182001 words and 3415554 word types
2014-05-08 15:26:08,234 : INFO : PROGRESS: at sentence #720000, processed 666105792 words and 3436467 word types
2014-05-08 15:26:55,100 : INFO : PROGRESS: at sentence #730000, processed 671694947 words and 3453789 word types
2014-05-08 15:27:45,244 : INFO : PROGRESS: at sentence #740000, processed 677794403 words and 3474194 word types
2014-05-08 15:28:34,606 : INFO : PROGRESS: at sentence #750000, processed 683793987 words and 3497098 word types
2014-05-08 15:29:24,525 : INFO : PROGRESS: at sentence #760000, processed 689685667 words and 3517819 word types
2014-05-08 15:30:12,042 : INFO : PROGRESS: at sentence #770000, processed 695451483 words and 3537703 word types
2014-05-08 15:31:03,263 : INFO : PROGRESS: at sentence #780000, processed 701474234 words and 3562360 word types
2014-05-08 15:31:49,521 : INFO : PROGRESS: at sentence #790000, processed 706963709 words and 3580865 word types
2014-05-08 15:32:36,659 : INFO : PROGRESS: at sentence #800000, processed 712582998 words and 3603765 word types
2014-05-08 15:33:23,482 : INFO : PROGRESS: at sentence #810000, processed 718248495 words and 3626246 word types
2014-05-08 15:34:11,980 : INFO : PROGRESS: at sentence #820000, processed 724007057 words and 3646233 word types
2014-05-08 15:35:00,564 : INFO : PROGRESS: at sentence #830000, processed 729685351 words and 3667230 word types
2014-05-08 15:35:48,011 : INFO : PROGRESS: at sentence #840000, processed 735170107 words and 3689294 word types
2014-05-08 15:36:34,230 : INFO : PROGRESS: at sentence #850000, processed 740673880 words and 3711522 word types
2014-05-08 15:37:21,702 : INFO : PROGRESS: at sentence #860000, processed 746239382 words and 3731735 word types
2014-05-08 15:38:09,223 : INFO : PROGRESS: at sentence #870000, processed 751891687 words and 3753327 word types
2014-05-08 15:38:55,018 : INFO : PROGRESS: at sentence #880000, processed 757347369 words and 3774188 word types
2014-05-08 15:39:40,106 : INFO : PROGRESS: at sentence #890000, processed 762752540 words and 3792423 word types
2014-05-08 15:40:25,402 : INFO : PROGRESS: at sentence #900000, processed 768153208 words and 3812976 word types
2014-05-08 15:41:11,556 : INFO : PROGRESS: at sentence #910000, processed 773571770 words and 3830915 word types
2014-05-08 15:41:57,517 : INFO : PROGRESS: at sentence #920000, processed 779043355 words and 3849092 word types
2014-05-08 15:42:43,656 : INFO : PROGRESS: at sentence #930000, processed 784519252 words and 3868269 word types
2014-05-08 15:43:27,889 : INFO : PROGRESS: at sentence #940000, processed 789698096 words and 3885987 word types
2014-05-08 15:44:15,581 : INFO : PROGRESS: at sentence #950000, processed 795300517 words and 3903502 word types
2014-05-08 15:45:02,000 : INFO : PROGRESS: at sentence #960000, processed 800720213 words and 3922618 word types
2014-05-08 15:45:47,406 : INFO : PROGRESS: at sentence #970000, processed 806072808 words and 3945172 word types
2014-05-08 15:46:32,319 : INFO : PROGRESS: at sentence #980000, processed 811223942 words and 3970422 word types
2014-05-08 15:47:17,579 : INFO : PROGRESS: at sentence #990000, processed 816423731 words and 3995092 word types
2014-05-08 15:48:03,785 : INFO : PROGRESS: at sentence #1000000, processed 821731995 words and 4020465 word types
2014-05-08 15:48:48,892 : INFO : PROGRESS: at sentence #1010000, processed 826911968 words and 4039889 word types
2014-05-08 15:49:32,245 : INFO : PROGRESS: at sentence #1020000, processed 831910678 words and 4062081 word types
2014-05-08 15:50:14,964 : INFO : PROGRESS: at sentence #1030000, processed 836816738 words and 4080743 word types
2014-05-08 15:50:53,979 : INFO : PROGRESS: at sentence #1040000, processed 841178764 words and 4105646 word types
2014-05-08 15:51:40,783 : INFO : PROGRESS: at sentence #1050000, processed 846417594 words and 4129329 word types
2014-05-08 15:52:24,729 : INFO : PROGRESS: at sentence #1060000, processed 851422965 words and 4149042 word types
2014-05-08 15:53:08,637 : INFO : PROGRESS: at sentence #1070000, processed 856383965 words and 4166587 word types
2014-05-08 15:53:52,206 : INFO : PROGRESS: at sentence #1080000, processed 861352284 words and 4184331 word types
2014-05-08 15:54:34,893 : INFO : PROGRESS: at sentence #1090000, processed 866117266 words and 4201077 word types
2014-05-08 15:55:20,052 : INFO : PROGRESS: at sentence #1100000, processed 871328524 words and 4219420 word types
2014-05-08 15:56:02,953 : INFO : PROGRESS: at sentence #1110000, processed 876256929 words and 4235733 word types
2014-05-08 15:56:44,231 : INFO : PROGRESS: at sentence #1120000, processed 880876643 words and 4255730 word types
2014-05-08 15:57:23,903 : INFO : PROGRESS: at sentence #1130000, processed 885327600 words and 4272217 word types
2014-05-08 15:58:04,506 : INFO : PROGRESS: at sentence #1140000, processed 889905828 words and 4289660 word types
2014-05-08 15:58:45,934 : INFO : PROGRESS: at sentence #1150000, processed 894593596 words and 4304708 word types
2014-05-08 15:59:30,971 : INFO : PROGRESS: at sentence #1160000, processed 899814240 words and 4320787 word types
2014-05-08 16:00:13,210 : INFO : PROGRESS: at sentence #1170000, processed 904624896 words and 4337293 word types
2014-05-08 16:00:57,036 : INFO : PROGRESS: at sentence #1180000, processed 909527425 words and 4354898 word types
2014-05-08 16:01:38,417 : INFO : PROGRESS: at sentence #1190000, processed 914044199 words and 4371276 word types
2014-05-08 16:02:20,153 : INFO : PROGRESS: at sentence #1200000, processed 918547123 words and 4389456 word types
2014-05-08 16:03:02,713 : INFO : PROGRESS: at sentence #1210000, processed 923280585 words and 4408178 word types
2014-05-08 16:03:45,215 : INFO : PROGRESS: at sentence #1220000, processed 928021798 words and 4424218 word types
2014-05-08 16:04:26,932 : INFO : PROGRESS: at sentence #1230000, processed 932712684 words and 4440560 word types
2014-05-08 16:05:08,229 : INFO : PROGRESS: at sentence #1240000, processed 937285555 words and 4456021 word types
2014-05-08 16:05:54,891 : INFO : PROGRESS: at sentence #1250000, processed 942162700 words and 4477106 word types
2014-05-08 16:06:37,512 : INFO : PROGRESS: at sentence #1260000, processed 946820029 words and 4494317 word types
2014-05-08 16:07:21,993 : INFO : PROGRESS: at sentence #1270000, processed 951610768 words and 4511380 word types
2014-05-08 16:08:05,274 : INFO : PROGRESS: at sentence #1280000, processed 956352270 words and 4527115 word types
2014-05-08 16:08:47,809 : INFO : PROGRESS: at sentence #1290000, processed 960990098 words and 4541894 word types
2014-05-08 16:09:31,706 : INFO : PROGRESS: at sentence #1300000, processed 965701998 words and 4557089 word types
2014-05-08 16:10:14,924 : INFO : PROGRESS: at sentence #1310000, processed 970342902 words and 4572620 word types
2014-05-08 16:11:00,691 : INFO : PROGRESS: at sentence #1320000, processed 975518670 words and 4590273 word types
2014-05-08 16:11:43,998 : INFO : PROGRESS: at sentence #1330000, processed 980371951 words and 4608548 word types
2014-05-08 16:12:28,049 : INFO : PROGRESS: at sentence #1340000, processed 985208995 words and 4624395 word types
2014-05-08 16:13:09,944 : INFO : PROGRESS: at sentence #1350000, processed 989768531 words and 4647093 word types
2014-05-08 16:13:52,285 : INFO : PROGRESS: at sentence #1360000, processed 994296300 words and 4662453 word types
2014-05-08 16:14:34,203 : INFO : PROGRESS: at sentence #1370000, processed 998864807 words and 4677324 word types
2014-05-08 16:15:16,650 : INFO : PROGRESS: at sentence #1380000, processed 1003477213 words and 4693873 word types
2014-05-08 16:15:57,613 : INFO : PROGRESS: at sentence #1390000, processed 1007918796 words and 4707402 word types
2014-05-08 16:16:41,649 : INFO : PROGRESS: at sentence #1400000, processed 1012682007 words and 4723893 word types
2014-05-08 16:17:24,616 : INFO : PROGRESS: at sentence #1410000, processed 1017286184 words and 4744878 word types
2014-05-08 16:18:07,071 : INFO : PROGRESS: at sentence #1420000, processed 1021831553 words and 4761863 word types
2014-05-08 16:18:50,209 : INFO : PROGRESS: at sentence #1430000, processed 1026436750 words and 4778813 word types
2014-05-08 16:19:32,948 : INFO : PROGRESS: at sentence #1440000, processed 1030845489 words and 4798175 word types
2014-05-08 16:20:14,667 : INFO : PROGRESS: at sentence #1450000, processed 1035319418 words and 4815060 word types
2014-05-08 16:20:57,805 : INFO : PROGRESS: at sentence #1460000, processed 1039958046 words and 4830250 word types
2014-05-08 16:21:41,881 : INFO : PROGRESS: at sentence #1470000, processed 1044598240 words and 4845249 word types
2014-05-08 16:22:32,800 : INFO : PROGRESS: at sentence #1480000, processed 1050083901 words and 4862100 word types
2014-05-08 16:23:16,735 : INFO : PROGRESS: at sentence #1490000, processed 1054816614 words and 4876372 word types
2014-05-08 16:24:00,916 : INFO : PROGRESS: at sentence #1500000, processed 1059532983 words and 4891213 word types
2014-05-08 16:24:44,994 : INFO : PROGRESS: at sentence #1510000, processed 1064224319 words and 4906645 word types
2014-05-08 16:25:26,861 : INFO : PROGRESS: at sentence #1520000, processed 1068551666 words and 4923050 word types
2014-05-08 16:26:04,237 : INFO : PROGRESS: at sentence #1530000, processed 1072392249 words and 4936491 word types
2014-05-08 16:26:46,762 : INFO : PROGRESS: at sentence #1540000, processed 1076722634 words and 4953048 word types
2014-05-08 16:27:31,130 : INFO : PROGRESS: at sentence #1550000, processed 1081317029 words and 4969907 word types
2014-05-08 16:28:13,380 : INFO : PROGRESS: at sentence #1560000, processed 1085731611 words and 4982710 word types
2014-05-08 16:28:56,202 : INFO : PROGRESS: at sentence #1570000, processed 1090124330 words and 4998689 word types
2014-05-08 16:29:37,612 : INFO : PROGRESS: at sentence #1580000, processed 1093908497 words and 5012482 word types
2014-05-08 16:30:20,776 : INFO : PROGRESS: at sentence #1590000, processed 1098219796 words and 5027074 word types
2014-05-08 16:30:54,563 : INFO : PROGRESS: at sentence #1600000, processed 1101347934 words and 5038360 word types
2014-05-08 16:31:33,097 : INFO : PROGRESS: at sentence #1610000, processed 1104896902 words and 5048794 word types
2014-05-08 16:32:15,377 : INFO : PROGRESS: at sentence #1620000, processed 1108986543 words and 5064412 word types
2014-05-08 16:32:56,793 : INFO : PROGRESS: at sentence #1630000, processed 1113059242 words and 5078600 word types
2014-05-08 16:33:38,183 : INFO : PROGRESS: at sentence #1640000, processed 1117057911 words and 5094576 word types
2014-05-08 16:34:24,113 : INFO : PROGRESS: at sentence #1650000, processed 1121404872 words and 5111402 word types
2014-05-08 16:35:07,176 : INFO : PROGRESS: at sentence #1660000, processed 1125599555 words and 5125779 word types
2014-05-08 16:35:53,085 : INFO : PROGRESS: at sentence #1670000, processed 1129684098 words and 5140430 word types
2014-05-08 16:36:36,364 : INFO : PROGRESS: at sentence #1680000, processed 1133841563 words and 5154982 word types
2014-05-08 16:37:19,873 : INFO : PROGRESS: at sentence #1690000, processed 1138132828 words and 5168258 word types
2014-05-08 16:38:04,854 : INFO : PROGRESS: at sentence #1700000, processed 1142474167 words and 5183552 word types
2014-05-08 16:38:46,126 : INFO : PROGRESS: at sentence #1710000, processed 1146395904 words and 5197894 word types
2014-05-08 16:39:28,940 : INFO : PROGRESS: at sentence #1720000, processed 1150603278 words and 5225223 word types
2014-05-08 16:40:13,260 : INFO : PROGRESS: at sentence #1730000, processed 1154723920 words and 5237996 word types
2014-05-08 16:40:58,749 : INFO : PROGRESS: at sentence #1740000, processed 1158973646 words and 5249842 word types
2014-05-08 16:41:47,148 : INFO : PROGRESS: at sentence #1750000, processed 1163358879 words and 5264109 word types
2014-05-08 16:42:31,333 : INFO : PROGRESS: at sentence #1760000, processed 1167646172 words and 5277999 word types
2014-05-08 16:43:13,952 : INFO : PROGRESS: at sentence #1770000, processed 1171794989 words and 5292017 word types
2014-05-08 16:44:06,419 : INFO : PROGRESS: at sentence #1780000, processed 1175776155 words and 5307512 word types
2014-05-08 16:44:53,895 : INFO : PROGRESS: at sentence #1790000, processed 1179672039 words and 5327952 word types
2014-05-08 16:45:42,084 : INFO : PROGRESS: at sentence #1800000, processed 1183803710 words and 5351906 word types
2014-05-08 16:46:27,826 : INFO : PROGRESS: at sentence #1810000, processed 1187982377 words and 5369997 word types
2014-05-08 16:47:12,458 : INFO : PROGRESS: at sentence #1820000, processed 1192079633 words and 5383893 word types
2014-05-08 16:47:58,359 : INFO : PROGRESS: at sentence #1830000, processed 1196046229 words and 5398143 word types
2014-05-08 16:48:44,180 : INFO : PROGRESS: at sentence #1840000, processed 1200096927 words and 5410552 word types
2014-05-08 16:49:32,435 : INFO : PROGRESS: at sentence #1850000, processed 1204186815 words and 5422895 word types
2014-05-08 16:50:18,304 : INFO : PROGRESS: at sentence #1860000, processed 1208337156 words and 5435650 word types
2014-05-08 16:51:02,868 : INFO : PROGRESS: at sentence #1870000, processed 1212691863 words and 5450716 word types
2014-05-08 16:51:51,457 : INFO : PROGRESS: at sentence #1880000, processed 1216909421 words and 5463134 word types
2014-05-08 16:52:52,552 : INFO : PROGRESS: at sentence #1890000, processed 1220892300 words and 5476279 word types
2014-05-08 16:53:39,940 : INFO : PROGRESS: at sentence #1900000, processed 1225121794 words and 5490160 word types
2014-05-08 16:54:23,944 : INFO : PROGRESS: at sentence #1910000, processed 1229148880 words and 5504435 word types
2014-05-08 16:55:10,289 : INFO : PROGRESS: at sentence #1920000, processed 1233409688 words and 5518467 word types
2014-05-08 16:55:52,963 : INFO : PROGRESS: at sentence #1930000, processed 1237548671 words and 5530492 word types
2014-05-08 16:56:37,513 : INFO : PROGRESS: at sentence #1940000, processed 1241688234 words and 5543457 word types
2014-05-08 16:57:22,273 : INFO : PROGRESS: at sentence #1950000, processed 1245840128 words and 5555973 word types
2014-05-08 16:58:05,199 : INFO : PROGRESS: at sentence #1960000, processed 1249954713 words and 5569097 word types
2014-05-08 16:58:48,097 : INFO : PROGRESS: at sentence #1970000, processed 1254115397 words and 5580912 word types
2014-05-08 16:59:32,155 : INFO : PROGRESS: at sentence #1980000, processed 1258250835 words and 5592728 word types
2014-05-08 17:00:14,906 : INFO : PROGRESS: at sentence #1990000, processed 1262168839 words and 5604694 word types
2014-05-08 17:01:03,249 : INFO : PROGRESS: at sentence #2000000, processed 1266499458 words and 5618266 word types
2014-05-08 17:01:49,805 : INFO : PROGRESS: at sentence #2010000, processed 1270599364 words and 5632202 word types
2014-05-08 17:02:33,224 : INFO : PROGRESS: at sentence #2020000, processed 1274695378 words and 5647157 word types
2014-05-08 17:03:20,091 : INFO : PROGRESS: at sentence #2030000, processed 1278784551 words and 5661314 word types
2014-05-08 17:04:01,939 : INFO : PROGRESS: at sentence #2040000, processed 1282554336 words and 5672125 word types
2014-05-08 17:04:53,910 : INFO : PROGRESS: at sentence #2050000, processed 1287469733 words and 5687392 word types
2014-05-08 17:05:49,228 : INFO : PROGRESS: at sentence #2060000, processed 1292089005 words and 5700095 word types
2014-05-08 17:06:35,945 : INFO : PROGRESS: at sentence #2070000, processed 1296558847 words and 5713773 word types
2014-05-08 17:07:22,644 : INFO : PROGRESS: at sentence #2080000, processed 1301248832 words and 5730143 word types
2014-05-08 17:08:08,804 : INFO : PROGRESS: at sentence #2090000, processed 1305448518 words and 5745184 word types
2014-05-08 17:08:58,885 : INFO : PROGRESS: at sentence #2100000, processed 1309786937 words and 5760088 word types
2014-05-08 17:09:50,113 : INFO : PROGRESS: at sentence #2110000, processed 1314043628 words and 5774027 word types
2014-05-08 17:10:35,096 : INFO : PROGRESS: at sentence #2120000, processed 1318046742 words and 5786586 word types
2014-05-08 17:11:19,637 : INFO : PROGRESS: at sentence #2130000, processed 1322266462 words and 5799119 word types
2014-05-08 17:12:03,815 : INFO : PROGRESS: at sentence #2140000, processed 1326393613 words and 5812066 word types
2014-05-08 17:12:47,857 : INFO : PROGRESS: at sentence #2150000, processed 1330634173 words and 5825518 word types
2014-05-08 17:13:35,840 : INFO : PROGRESS: at sentence #2160000, processed 1335039574 words and 5839286 word types
2014-05-08 17:14:20,861 : INFO : PROGRESS: at sentence #2170000, processed 1339244732 words and 5853802 word types
2014-05-08 17:15:06,622 : INFO : PROGRESS: at sentence #2180000, processed 1343554575 words and 5866960 word types
2014-05-08 17:15:56,168 : INFO : PROGRESS: at sentence #2190000, processed 1348057155 words and 5879813 word types
2014-05-08 17:16:42,898 : INFO : PROGRESS: at sentence #2200000, processed 1352182531 words and 5892755 word types
2014-05-08 17:17:28,460 : INFO : PROGRESS: at sentence #2210000, processed 1356561376 words and 5908573 word types
2014-05-08 17:18:11,463 : INFO : PROGRESS: at sentence #2220000, processed 1360577854 words and 5922195 word types
2014-05-08 17:18:53,650 : INFO : PROGRESS: at sentence #2230000, processed 1364446293 words and 5935676 word types
2014-05-08 17:19:32,265 : INFO : PROGRESS: at sentence #2240000, processed 1367906286 words and 5950138 word types
2014-05-08 17:20:08,577 : INFO : PROGRESS: at sentence #2250000, processed 1371001176 words and 5957639 word types
2014-05-08 17:20:47,387 : INFO : PROGRESS: at sentence #2260000, processed 1374596494 words and 5969486 word types
2014-05-08 17:21:31,271 : INFO : PROGRESS: at sentence #2270000, processed 1378544503 words and 5987549 word types
2014-05-08 17:22:18,832 : INFO : PROGRESS: at sentence #2280000, processed 1382775687 words and 6005861 word types
2014-05-08 17:23:04,811 : INFO : PROGRESS: at sentence #2290000, processed 1386861515 words and 6019202 word types
2014-05-08 17:23:50,895 : INFO : PROGRESS: at sentence #2300000, processed 1390963700 words and 6032742 word types
2014-05-08 17:24:37,379 : INFO : PROGRESS: at sentence #2310000, processed 1395009960 words and 6046765 word types
2014-05-08 17:25:20,246 : INFO : PROGRESS: at sentence #2320000, processed 1399068422 words and 6058059 word types
2014-05-08 17:26:05,034 : INFO : PROGRESS: at sentence #2330000, processed 1403136635 words and 6069747 word types
2014-05-08 17:26:53,134 : INFO : PROGRESS: at sentence #2340000, processed 1407544681 words and 6083353 word types
2014-05-08 17:27:43,470 : INFO : PROGRESS: at sentence #2350000, processed 1411841916 words and 6097388 word types
2014-05-08 17:28:37,537 : INFO : PROGRESS: at sentence #2360000, processed 1416295328 words and 6113927 word types
2014-05-08 17:29:24,715 : INFO : PROGRESS: at sentence #2370000, processed 1420378577 words and 6126279 word types
2014-05-08 17:30:10,170 : INFO : PROGRESS: at sentence #2380000, processed 1424377004 words and 6146725 word types
2014-05-08 17:30:55,991 : INFO : PROGRESS: at sentence #2390000, processed 1428528290 words and 6161301 word types
2014-05-08 17:31:42,675 : INFO : PROGRESS: at sentence #2400000, processed 1432640164 words and 6175428 word types
2014-05-08 17:32:25,734 : INFO : PROGRESS: at sentence #2410000, processed 1436462991 words and 6193902 word types
2014-05-08 17:33:06,160 : INFO : PROGRESS: at sentence #2420000, processed 1440058898 words and 6207214 word types
2014-05-08 17:33:53,330 : INFO : PROGRESS: at sentence #2430000, processed 1444562136 words and 6220811 word types
2014-05-08 17:34:41,344 : INFO : PROGRESS: at sentence #2440000, processed 1448995031 words and 6234887 word types
2014-05-08 17:35:27,040 : INFO : PROGRESS: at sentence #2450000, processed 1453184919 words and 6252331 word types
2014-05-08 17:36:10,403 : INFO : PROGRESS: at sentence #2460000, processed 1457142570 words and 6270480 word types
2014-05-08 17:36:56,948 : INFO : PROGRESS: at sentence #2470000, processed 1461363294 words and 6284438 word types
2014-05-08 17:37:42,195 : INFO : PROGRESS: at sentence #2480000, processed 1465545523 words and 6299090 word types
2014-05-08 17:38:27,801 : INFO : PROGRESS: at sentence #2490000, processed 1469741989 words and 6311655 word types
2014-05-08 17:39:17,903 : INFO : PROGRESS: at sentence #2500000, processed 1474419114 words and 6323609 word types
2014-05-08 17:40:04,373 : INFO : PROGRESS: at sentence #2510000, processed 1478425487 words and 6335535 word types
2014-05-08 17:40:56,039 : INFO : PROGRESS: at sentence #2520000, processed 1482715932 words and 6349744 word types
2014-05-08 17:41:44,053 : INFO : PROGRESS: at sentence #2530000, processed 1487033565 words and 6363216 word types
2014-05-08 17:42:26,962 : INFO : PROGRESS: at sentence #2540000, processed 1490993965 words and 6378335 word types
2014-05-08 17:43:14,216 : INFO : PROGRESS: at sentence #2550000, processed 1495133134 words and 6392626 word types
2014-05-08 17:44:00,815 : INFO : PROGRESS: at sentence #2560000, processed 1499378723 words and 6408626 word types
2014-05-08 17:44:46,441 : INFO : PROGRESS: at sentence #2570000, processed 1503603835 words and 6427849 word types
2014-05-08 17:45:30,836 : INFO : PROGRESS: at sentence #2580000, processed 1507647549 words and 6440544 word types
2014-05-08 17:46:15,344 : INFO : PROGRESS: at sentence #2590000, processed 1511732186 words and 6452757 word types
2014-05-08 17:46:59,225 : INFO : PROGRESS: at sentence #2600000, processed 1515530822 words and 6464931 word types
2014-05-08 17:47:46,370 : INFO : PROGRESS: at sentence #2610000, processed 1519624921 words and 6478270 word types
2014-05-08 17:48:34,693 : INFO : PROGRESS: at sentence #2620000, processed 1523557015 words and 6490309 word types
2014-05-08 17:49:25,464 : INFO : PROGRESS: at sentence #2630000, processed 1527840089 words and 6502054 word types
2014-05-08 17:50:16,197 : INFO : PROGRESS: at sentence #2640000, processed 1532183124 words and 6519022 word types
2014-05-08 17:51:05,339 : INFO : PROGRESS: at sentence #2650000, processed 1536156208 words and 6534050 word types
2014-05-08 17:51:52,546 : INFO : PROGRESS: at sentence #2660000, processed 1540279975 words and 6547062 word types
2014-05-08 17:52:40,300 : INFO : PROGRESS: at sentence #2670000, processed 1544307955 words and 6561701 word types
2014-05-08 17:53:26,174 : INFO : PROGRESS: at sentence #2680000, processed 1548342807 words and 6575825 word types
2014-05-08 17:54:11,429 : INFO : PROGRESS: at sentence #2690000, processed 1552331759 words and 6588225 word types
2014-05-08 17:54:56,376 : INFO : PROGRESS: at sentence #2700000, processed 1556306707 words and 6601165 word types
2014-05-08 17:55:53,114 : INFO : PROGRESS: at sentence #2710000, processed 1560213481 words and 6616359 word types
2014-05-08 17:56:44,781 : INFO : PROGRESS: at sentence #2720000, processed 1564371253 words and 6632978 word types
2014-05-08 17:57:34,807 : INFO : PROGRESS: at sentence #2730000, processed 1568589855 words and 6662246 word types
2014-05-08 17:58:21,173 : INFO : PROGRESS: at sentence #2740000, processed 1572612302 words and 6674793 word types
2014-05-08 17:59:05,682 : INFO : PROGRESS: at sentence #2750000, processed 1576528451 words and 6687326 word types
2014-05-08 17:59:47,660 : INFO : PROGRESS: at sentence #2760000, processed 1580134181 words and 6699255 word types
2014-05-08 18:00:25,079 : INFO : PROGRESS: at sentence #2770000, processed 1583337195 words and 6709213 word types
2014-05-08 18:01:03,806 : INFO : PROGRESS: at sentence #2780000, processed 1586577700 words and 6719987 word types
2014-05-08 18:01:52,422 : INFO : PROGRESS: at sentence #2790000, processed 1590328668 words and 6731187 word types
2014-05-08 18:02:46,981 : INFO : PROGRESS: at sentence #2800000, processed 1594517368 words and 6745881 word types
2014-05-08 18:03:34,480 : INFO : PROGRESS: at sentence #2810000, processed 1598738677 words and 6761457 word types
2014-05-08 18:04:23,025 : INFO : PROGRESS: at sentence #2820000, processed 1602907924 words and 6774678 word types
2014-05-08 18:05:13,048 : INFO : PROGRESS: at sentence #2830000, processed 1607118347 words and 6787928 word types
2014-05-08 18:05:59,027 : INFO : PROGRESS: at sentence #2840000, processed 1611238886 words and 6801111 word types
2014-05-08 18:06:54,311 : INFO : PROGRESS: at sentence #2850000, processed 1617339858 words and 6818517 word types
2014-05-08 18:07:41,432 : INFO : PROGRESS: at sentence #2860000, processed 1621533255 words and 6832004 word types
2014-05-08 18:08:29,992 : INFO : PROGRESS: at sentence #2870000, processed 1625790957 words and 6846192 word types
2014-05-08 18:09:16,731 : INFO : PROGRESS: at sentence #2880000, processed 1630039482 words and 6862486 word types
2014-05-08 18:10:06,583 : INFO : PROGRESS: at sentence #2890000, processed 1634330730 words and 6876945 word types
2014-05-08 18:10:53,591 : INFO : PROGRESS: at sentence #2900000, processed 1638634352 words and 6890728 word types
2014-05-08 18:11:42,609 : INFO : PROGRESS: at sentence #2910000, processed 1642872590 words and 6903650 word types
2014-05-08 18:12:32,568 : INFO : PROGRESS: at sentence #2920000, processed 1647101725 words and 6916862 word types
2014-05-08 18:13:24,108 : INFO : PROGRESS: at sentence #2930000, processed 1651449625 words and 6931003 word types
2014-05-08 18:14:13,494 : INFO : PROGRESS: at sentence #2940000, processed 1655589739 words and 6947864 word types
2014-05-08 18:15:01,739 : INFO : PROGRESS: at sentence #2950000, processed 1659564330 words and 6962264 word types
2014-05-08 18:15:50,950 : INFO : PROGRESS: at sentence #2960000, processed 1663597908 words and 6978087 word types
2014-05-08 18:16:35,891 : INFO : PROGRESS: at sentence #2970000, processed 1667261706 words and 6993888 word types
2014-05-08 18:17:24,734 : INFO : PROGRESS: at sentence #2980000, processed 1671219826 words and 7007901 word types
2014-05-08 18:18:11,533 : INFO : PROGRESS: at sentence #2990000, processed 1674868237 words and 7020662 word types
2014-05-08 18:19:00,456 : INFO : PROGRESS: at sentence #3000000, processed 1678909464 words and 7034184 word types
2014-05-08 18:19:51,055 : INFO : PROGRESS: at sentence #3010000, processed 1683076241 words and 7047851 word types
2014-05-08 18:20:41,347 : INFO : PROGRESS: at sentence #3020000, processed 1687182649 words and 7061090 word types
2014-05-08 18:21:30,249 : INFO : PROGRESS: at sentence #3030000, processed 1691312131 words and 7074500 word types
2014-05-08 18:22:17,190 : INFO : PROGRESS: at sentence #3040000, processed 1695428242 words and 7088578 word types
2014-05-08 18:23:05,323 : INFO : PROGRESS: at sentence #3050000, processed 1699509448 words and 7101934 word types
2014-05-08 18:23:54,381 : INFO : PROGRESS: at sentence #3060000, processed 1703690507 words and 7114043 word types
2014-05-08 18:24:44,577 : INFO : PROGRESS: at sentence #3070000, processed 1707900768 words and 7126343 word types
2014-05-08 18:25:31,773 : INFO : PROGRESS: at sentence #3080000, processed 1711806518 words and 7138279 word types
2014-05-08 18:26:22,708 : INFO : PROGRESS: at sentence #3090000, processed 1715799140 words and 7151545 word types
2014-05-08 18:27:14,846 : INFO : PROGRESS: at sentence #3100000, processed 1719960157 words and 7163483 word types
2014-05-08 18:28:03,246 : INFO : PROGRESS: at sentence #3110000, processed 1723999511 words and 7176505 word types
2014-05-08 18:28:52,448 : INFO : PROGRESS: at sentence #3120000, processed 1728073382 words and 7188802 word types
2014-05-08 18:29:44,673 : INFO : PROGRESS: at sentence #3130000, processed 1732217153 words and 7204513 word types
2014-05-08 18:30:34,325 : INFO : PROGRESS: at sentence #3140000, processed 1736255445 words and 7216542 word types
2014-05-08 18:31:24,540 : INFO : PROGRESS: at sentence #3150000, processed 1740348156 words and 7229643 word types
2014-05-08 18:32:18,557 : INFO : PROGRESS: at sentence #3160000, processed 1744379816 words and 7242297 word types
2014-05-08 18:33:12,071 : INFO : PROGRESS: at sentence #3170000, processed 1748437785 words and 7253069 word types
2014-05-08 18:34:04,418 : INFO : PROGRESS: at sentence #3180000, processed 1752363922 words and 7267529 word types
2014-05-08 18:34:56,499 : INFO : PROGRESS: at sentence #3190000, processed 1756010116 words and 7279179 word types
2014-05-08 18:35:46,798 : INFO : PROGRESS: at sentence #3200000, processed 1759820487 words and 7292569 word types
2014-05-08 18:36:37,638 : INFO : PROGRESS: at sentence #3210000, processed 1763650896 words and 7304769 word types
2014-05-08 18:37:28,454 : INFO : PROGRESS: at sentence #3220000, processed 1767446791 words and 7322081 word types
2014-05-08 18:38:17,824 : INFO : PROGRESS: at sentence #3230000, processed 1771164864 words and 7335587 word types
2014-05-08 18:39:10,650 : INFO : PROGRESS: at sentence #3240000, processed 1775195034 words and 7349231 word types
2014-05-08 18:40:02,881 : INFO : PROGRESS: at sentence #3250000, processed 1778847648 words and 7362064 word types
2014-05-08 18:40:55,636 : INFO : PROGRESS: at sentence #3260000, processed 1782984853 words and 7375654 word types
2014-05-08 18:41:46,255 : INFO : PROGRESS: at sentence #3270000, processed 1787143624 words and 7390580 word types
2014-05-08 18:42:38,834 : INFO : PROGRESS: at sentence #3280000, processed 1791189183 words and 7403056 word types
2014-05-08 18:43:31,958 : INFO : PROGRESS: at sentence #3290000, processed 1795167362 words and 7415872 word types
2014-05-08 18:44:23,636 : INFO : PROGRESS: at sentence #3300000, processed 1799306040 words and 7429570 word types
2014-05-08 18:45:12,313 : INFO : PROGRESS: at sentence #3310000, processed 1803141405 words and 7441021 word types
2014-05-08 18:46:02,579 : INFO : PROGRESS: at sentence #3320000, processed 1807144921 words and 7454225 word types
2014-05-08 18:46:54,052 : INFO : PROGRESS: at sentence #3330000, processed 1811284571 words and 7467687 word types
2014-05-08 18:47:44,618 : INFO : PROGRESS: at sentence #3340000, processed 1815256795 words and 7478635 word types
2014-05-08 18:48:32,743 : INFO : PROGRESS: at sentence #3350000, processed 1819043375 words and 7489969 word types
2014-05-08 18:49:20,391 : INFO : PROGRESS: at sentence #3360000, processed 1822918770 words and 7501645 word types
2014-05-08 18:50:10,088 : INFO : PROGRESS: at sentence #3370000, processed 1826872324 words and 7513653 word types
2014-05-08 18:50:57,865 : INFO : PROGRESS: at sentence #3380000, processed 1830669310 words and 7525852 word types
2014-05-08 18:51:46,750 : INFO : PROGRESS: at sentence #3390000, processed 1834413795 words and 7536158 word types
2014-05-08 18:52:36,455 : INFO : PROGRESS: at sentence #3400000, processed 1838100661 words and 7547212 word types
2014-05-08 18:53:24,612 : INFO : PROGRESS: at sentence #3410000, processed 1841852852 words and 7558185 word types
2014-05-08 18:54:14,913 : INFO : PROGRESS: at sentence #3420000, processed 1845344999 words and 7575423 word types
2014-05-08 18:55:05,099 : INFO : PROGRESS: at sentence #3430000, processed 1848992041 words and 7590966 word types
2014-05-08 18:55:54,947 : INFO : PROGRESS: at sentence #3440000, processed 1852769069 words and 7607335 word types
2014-05-08 18:56:41,119 : INFO : PROGRESS: at sentence #3450000, processed 1856285033 words and 7621937 word types
2014-05-08 18:57:29,130 : INFO : PROGRESS: at sentence #3460000, processed 1859800113 words and 7633060 word types
2014-05-08 18:58:17,493 : INFO : PROGRESS: at sentence #3470000, processed 1863379610 words and 7644498 word types
2014-05-08 18:59:02,506 : INFO : PROGRESS: at sentence #3480000, processed 1866962936 words and 7654771 word types
2014-05-08 18:59:49,067 : INFO : PROGRESS: at sentence #3490000, processed 1870480975 words and 7666905 word types
2014-05-08 19:00:34,780 : INFO : PROGRESS: at sentence #3500000, processed 1873906154 words and 7677671 word types
2014-05-08 19:01:20,515 : INFO : PROGRESS: at sentence #3510000, processed 1877241011 words and 7688280 word types
2014-05-08 19:02:05,649 : INFO : PROGRESS: at sentence #3520000, processed 1880406083 words and 7698496 word types
2014-05-08 19:02:46,843 : INFO : finished iterating over Wikipedia corpus of 3529460 documents with 1883635185 positions (total 14313024 articles, 1937483820 positions before pruning articles shorter than 50 words)
2014-05-08 19:02:46,843 : INFO : collected 7710827 word types from a corpus of 1883635185 words and 3529460 sentences
2014-05-08 19:02:50,215 : INFO : total 1878619 word types after removing those with count<5
2014-05-08 19:02:50,215 : INFO : constructing a huffman tree from 1878619 words
2014-05-08 19:04:10,334 : INFO : built huffman tree with maximum node depth 28
2014-05-08 19:04:10,943 : INFO : resetting layer weights
2014-05-08 19:04:34,879 : INFO : training model with 4 workers on 1878619 vocabulary and 400 features
2014-05-08 19:04:58,513 : INFO : PROGRESS: at 0.00% words, alpha 0.02500, 3690 words/s
2014-05-08 19:05:01,884 : INFO : PROGRESS: at 0.01% words, alpha 0.02500, 6350 words/s
2014-05-08 19:05:05,032 : INFO : PROGRESS: at 0.01% words, alpha 0.02500, 8466 words/s
2014-05-08 19:05:08,998 : INFO : PROGRESS: at 0.02% words, alpha 0.02500, 10077 words/s
2014-05-08 19:05:11,503 : INFO : PROGRESS: at 0.02% words, alpha 0.02500, 11524 words/s
2014-05-08 19:05:13,775 : INFO : PROGRESS: at 0.03% words, alpha 0.02500, 12758 words/s
2014-05-08 19:05:17,389 : INFO : PROGRESS: at 0.03% words, alpha 0.02500, 13453 words/s
2014-05-08 19:05:22,190 : INFO : PROGRESS: at 0.03% words, alpha 0.02500, 13801 words/s
2014-05-08 19:05:26,233 : INFO : PROGRESS: at 0.04% words, alpha 0.02499, 14453 words/s
2014-05-08 19:05:28,103 : INFO : PROGRESS: at 0.04% words, alpha 0.02499, 15575 words/s
2014-05-08 19:05:29,436 : INFO : PROGRESS: at 0.05% words, alpha 0.02499, 16520 words/s
2014-05-08 19:05:34,219 : INFO : PROGRESS: at 0.05% words, alpha 0.02499, 16450 words/s
2014-05-08 19:05:39,014 : INFO : PROGRESS: at 0.06% words, alpha 0.02499, 16449 words/s
2014-05-08 19:05:40,252 : INFO : PROGRESS: at 0.06% words, alpha 0.02499, 17294 words/s
2014-05-08 19:05:42,388 : INFO : PROGRESS: at 0.06% words, alpha 0.02499, 17946 words/s
2014-05-08 19:05:46,268 : INFO : PROGRESS: at 0.07% words, alpha 0.02499, 17990 words/s
2014-05-08 19:05:52,446 : INFO : PROGRESS: at 0.07% words, alpha 0.02499, 17637 words/s
2014-05-08 19:05:54,310 : INFO : PROGRESS: at 0.08% words, alpha 0.02498, 18289 words/s
2014-05-08 19:05:56,607 : INFO : PROGRESS: at 0.08% words, alpha 0.02498, 18885 words/s
2014-05-08 19:06:00,189 : INFO : PROGRESS: at 0.09% words, alpha 0.02498, 19096 words/s
2014-05-08 19:06:06,677 : INFO : PROGRESS: at 0.09% words, alpha 0.02498, 18711 words/s
2014-05-08 19:06:10,877 : INFO : PROGRESS: at 0.10% words, alpha 0.02498, 19631 words/s
2014-05-08 19:06:14,093 : INFO : PROGRESS: at 0.11% words, alpha 0.02498, 19854 words/s
2014-05-08 19:06:19,752 : INFO : PROGRESS: at 0.11% words, alpha 0.02498, 19564 words/s
2014-05-08 19:06:21,183 : INFO : PROGRESS: at 0.11% words, alpha 0.02498, 20123 words/s
2014-05-08 19:06:24,044 : INFO : PROGRESS: at 0.12% words, alpha 0.02497, 20366 words/s
2014-05-08 19:06:26,299 : INFO : PROGRESS: at 0.12% words, alpha 0.02497, 20685 words/s
2014-05-08 19:06:31,957 : INFO : PROGRESS: at 0.13% words, alpha 0.02497, 20372 words/s
2014-05-08 19:06:33,702 : INFO : PROGRESS: at 0.13% words, alpha 0.02497, 20764 words/s
2014-05-08 19:06:36,896 : INFO : PROGRESS: at 0.14% words, alpha 0.02497, 20917 words/s
2014-05-08 19:06:40,123 : INFO : PROGRESS: at 0.14% words, alpha 0.02497, 21055 words/s
2014-05-08 19:06:44,763 : INFO : PROGRESS: at 0.14% words, alpha 0.02497, 20921 words/s
2014-05-08 19:06:46,865 : INFO : PROGRESS: at 0.15% words, alpha 0.02497, 21224 words/s
2014-05-08 19:06:49,536 : INFO : PROGRESS: at 0.15% words, alpha 0.02497, 21397 words/s
2014-05-08 19:06:52,677 : INFO : PROGRESS: at 0.16% words, alpha 0.02496, 21483 words/s
2014-05-08 19:06:57,981 : INFO : PROGRESS: at 0.16% words, alpha 0.02496, 21266 words/s
2014-05-08 19:07:00,388 : INFO : PROGRESS: at 0.17% words, alpha 0.02496, 21499 words/s
2014-05-08 19:07:03,644 : INFO : PROGRESS: at 0.17% words, alpha 0.02496, 21631 words/s
2014-05-08 19:07:06,257 : INFO : PROGRESS: at 0.18% words, alpha 0.02496, 21828 words/s
2014-05-08 19:07:11,656 : INFO : PROGRESS: at 0.18% words, alpha 0.02496, 21620 words/s
2014-05-08 19:07:13,486 : INFO : PROGRESS: at 0.19% words, alpha 0.02496, 21890 words/s
2014-05-08 19:07:17,281 : INFO : PROGRESS: at 0.19% words, alpha 0.02496, 21907 words/s
2014-05-08 19:07:18,595 : INFO : PROGRESS: at 0.19% words, alpha 0.02496, 22220 words/s
2014-05-08 19:07:24,907 : INFO : PROGRESS: at 0.20% words, alpha 0.02495, 21920 words/s
2014-05-08 19:07:26,281 : INFO : PROGRESS: at 0.20% words, alpha 0.02495, 22239 words/s
2014-05-08 19:07:29,684 : INFO : PROGRESS: at 0.21% words, alpha 0.02495, 22282 words/s
2014-05-08 19:07:31,355 : INFO : PROGRESS: at 0.21% words, alpha 0.02495, 22549 words/s
2014-05-08 19:07:37,546 : INFO : PROGRESS: at 0.22% words, alpha 0.02495, 22173 words/s
2014-05-08 19:07:41,457 : INFO : PROGRESS: at 0.22% words, alpha 0.02495, 22557 words/s
2014-05-08 19:07:43,940 : INFO : PROGRESS: at 0.23% words, alpha 0.02495, 22691 words/s
2014-05-08 19:07:51,326 : INFO : PROGRESS: at 0.23% words, alpha 0.02495, 22278 words/s
2014-05-08 19:07:54,300 : INFO : PROGRESS: at 0.24% words, alpha 0.02494, 22799 words/s
2014-05-08 19:07:57,709 : INFO : PROGRESS: at 0.25% words, alpha 0.02494, 22831 words/s
2014-05-08 19:08:04,524 : INFO : PROGRESS: at 0.25% words, alpha 0.02494, 22464 words/s
2014-05-08 19:08:07,585 : INFO : PROGRESS: at 0.26% words, alpha 0.02494, 22948 words/s
2014-05-08 19:08:10,104 : INFO : PROGRESS: at 0.26% words, alpha 0.02494, 23060 words/s
2014-05-08 19:08:15,876 : INFO : PROGRESS: at 0.27% words, alpha 0.02494, 22819 words/s
2014-05-08 19:08:17,529 : INFO : PROGRESS: at 0.27% words, alpha 0.02494, 23028 words/s
2014-05-08 19:08:19,614 : INFO : PROGRESS: at 0.28% words, alpha 0.02493, 23185 words/s
2014-05-08 19:08:23,130 : INFO : PROGRESS: at 0.28% words, alpha 0.02493, 23204 words/s
2014-05-08 19:08:28,262 : INFO : PROGRESS: at 0.29% words, alpha 0.02493, 23062 words/s
2014-05-08 19:08:33,856 : INFO : PROGRESS: at 0.29% words, alpha 0.02493, 22888 words/s
2014-05-08 19:08:37,232 : INFO : PROGRESS: at 0.30% words, alpha 0.02493, 22926 words/s
2014-05-08 19:08:39,854 : INFO : PROGRESS: at 0.30% words, alpha 0.02493, 23022 words/s
2014-05-08 19:08:43,469 : INFO : PROGRESS: at 0.31% words, alpha 0.02493, 23044 words/s
2014-05-08 19:08:46,684 : INFO : PROGRESS: at 0.31% words, alpha 0.02493, 23081 words/s
2014-05-08 19:08:50,191 : INFO : PROGRESS: at 0.31% words, alpha 0.02493, 23073 words/s
2014-05-08 19:08:53,871 : INFO : PROGRESS: at 0.32% words, alpha 0.02492, 23055 words/s
2014-05-08 19:08:56,498 : INFO : PROGRESS: at 0.32% words, alpha 0.02492, 23130 words/s
2014-05-08 19:09:00,257 : INFO : PROGRESS: at 0.33% words, alpha 0.02492, 23121 words/s
2014-05-08 19:09:03,818 : INFO : PROGRESS: at 0.33% words, alpha 0.02492, 23130 words/s
2014-05-08 19:09:10,372 : INFO : PROGRESS: at 0.34% words, alpha 0.02492, 22937 words/s
2014-05-08 19:09:13,254 : INFO : PROGRESS: at 0.34% words, alpha 0.02492, 23024 words/s
2014-05-08 19:09:15,927 : INFO : PROGRESS: at 0.35% words, alpha 0.02492, 23111 words/s
2014-05-08 19:09:17,155 : INFO : PROGRESS: at 0.35% words, alpha 0.02492, 23293 words/s
2014-05-08 19:09:22,404 : INFO : PROGRESS: at 0.36% words, alpha 0.02492, 23148 words/s
2014-05-08 19:09:25,861 : INFO : PROGRESS: at 0.36% words, alpha 0.02491, 23155 words/s
2014-05-08 19:09:27,747 : INFO : PROGRESS: at 0.36% words, alpha 0.02491, 23266 words/s
2014-05-08 19:09:33,760 : INFO : PROGRESS: at 0.37% words, alpha 0.02491, 23265 words/s
2014-05-08 19:09:37,595 : INFO : PROGRESS: at 0.38% words, alpha 0.02491, 23224 words/s
2014-05-08 19:09:39,921 : INFO : PROGRESS: at 0.38% words, alpha 0.02491, 23313 words/s
2014-05-08 19:09:42,017 : INFO : PROGRESS: at 0.38% words, alpha 0.02491, 23433 words/s
2014-05-08 19:09:46,814 : INFO : PROGRESS: at 0.39% words, alpha 0.02491, 23340 words/s
2014-05-08 19:09:50,659 : INFO : PROGRESS: at 0.39% words, alpha 0.02491, 23319 words/s
2014-05-08 19:09:53,000 : INFO : PROGRESS: at 0.40% words, alpha 0.02491, 23410 words/s
2014-05-08 19:09:55,540 : INFO : PROGRESS: at 0.40% words, alpha 0.02490, 23483 words/s
2014-05-08 19:10:00,302 : INFO : PROGRESS: at 0.41% words, alpha 0.02490, 23403 words/s
2014-05-08 19:10:02,843 : INFO : PROGRESS: at 0.41% words, alpha 0.02490, 23471 words/s
2014-05-08 19:10:04,167 : INFO : PROGRESS: at 0.41% words, alpha 0.02490, 23606 words/s
2014-05-08 19:10:09,944 : INFO : PROGRESS: at 0.42% words, alpha 0.02490, 23453 words/s
2014-05-08 19:10:15,537 : INFO : PROGRESS: at 0.42% words, alpha 0.02490, 23341 words/s
2014-05-08 19:10:17,028 : INFO : PROGRESS: at 0.43% words, alpha 0.02490, 23482 words/s
2014-05-08 19:10:20,369 : INFO : PROGRESS: at 0.43% words, alpha 0.02490, 23487 words/s
2014-05-08 19:10:24,360 : INFO : PROGRESS: at 0.44% words, alpha 0.02490, 23468 words/s
2014-05-08 19:10:28,055 : INFO : PROGRESS: at 0.44% words, alpha 0.02489, 23463 words/s
2014-05-08 19:10:30,644 : INFO : PROGRESS: at 0.45% words, alpha 0.02489, 23534 words/s
2014-05-08 19:10:32,771 : INFO : PROGRESS: at 0.45% words, alpha 0.02489, 23618 words/s
2014-05-08 19:10:36,581 : INFO : PROGRESS: at 0.46% words, alpha 0.02489, 23588 words/s
2014-05-08 19:10:42,932 : INFO : PROGRESS: at 0.46% words, alpha 0.02489, 23433 words/s
2014-05-08 19:10:48,598 : INFO : PROGRESS: at 0.47% words, alpha 0.02489, 23516 words/s
2014-05-08 19:10:51,846 : INFO : PROGRESS: at 0.47% words, alpha 0.02489, 23542 words/s
2014-05-08 19:10:54,986 : INFO : PROGRESS: at 0.48% words, alpha 0.02488, 23552 words/s
2014-05-08 19:10:57,087 : INFO : PROGRESS: at 0.48% words, alpha 0.02488, 23643 words/s
2014-05-08 19:11:01,907 : INFO : PROGRESS: at 0.49% words, alpha 0.02488, 23566 words/s
2014-05-08 19:11:05,525 : INFO : PROGRESS: at 0.49% words, alpha 0.02488, 23568 words/s
2014-05-08 19:11:08,866 : INFO : PROGRESS: at 0.50% words, alpha 0.02488, 23587 words/s
2014-05-08 19:11:15,781 : INFO : PROGRESS: at 0.50% words, alpha 0.02488, 23585 words/s
2014-05-08 19:11:17,828 : INFO : PROGRESS: at 0.51% words, alpha 0.02488, 23662 words/s
2014-05-08 19:11:22,489 : INFO : PROGRESS: at 0.51% words, alpha 0.02488, 23578 words/s
2014-05-08 19:11:28,047 : INFO : PROGRESS: at 0.52% words, alpha 0.02487, 23648 words/s
2014-05-08 19:11:31,528 : INFO : PROGRESS: at 0.53% words, alpha 0.02487, 23659 words/s
2014-05-08 19:11:34,488 : INFO : PROGRESS: at 0.53% words, alpha 0.02487, 23666 words/s
2014-05-08 19:11:40,524 : INFO : PROGRESS: at 0.54% words, alpha 0.02487, 23697 words/s
2014-05-08 19:11:43,422 : INFO : PROGRESS: at 0.54% words, alpha 0.02487, 23710 words/s
2014-05-08 19:11:46,967 : INFO : PROGRESS: at 0.55% words, alpha 0.02487, 23697 words/s
2014-05-08 19:11:53,239 : INFO : PROGRESS: at 0.56% words, alpha 0.02487, 23738 words/s
2014-05-08 19:11:55,877 : INFO : PROGRESS: at 0.56% words, alpha 0.02486, 23782 words/s
2014-05-08 19:12:00,037 : INFO : PROGRESS: at 0.56% words, alpha 0.02486, 23752 words/s
2014-05-08 19:12:05,485 : INFO : PROGRESS: at 0.57% words, alpha 0.02486, 23837 words/s
2014-05-08 19:12:07,812 : INFO : PROGRESS: at 0.58% words, alpha 0.02486, 23888 words/s
2014-05-08 19:12:12,644 : INFO : PROGRESS: at 0.58% words, alpha 0.02486, 23819 words/s
2014-05-08 19:12:15,395 : INFO : PROGRESS: at 0.59% words, alpha 0.02486, 23856 words/s
2014-05-08 19:12:18,438 : INFO : PROGRESS: at 0.59% words, alpha 0.02486, 23845 words/s
2014-05-08 19:12:26,041 : INFO : PROGRESS: at 0.60% words, alpha 0.02485, 23818 words/s
2014-05-08 19:12:27,977 : INFO : PROGRESS: at 0.60% words, alpha 0.02485, 23887 words/s
2014-05-08 19:12:32,725 : INFO : PROGRESS: at 0.61% words, alpha 0.02485, 23837 words/s
2014-05-08 19:12:40,253 : INFO : PROGRESS: at 0.62% words, alpha 0.02485, 23818 words/s
2014-05-08 19:12:41,691 : INFO : PROGRESS: at 0.62% words, alpha 0.02485, 23926 words/s
2014-05-08 19:12:46,339 : INFO : PROGRESS: at 0.63% words, alpha 0.02485, 23873 words/s
2014-05-08 19:12:53,597 : INFO : PROGRESS: at 0.63% words, alpha 0.02485, 23861 words/s
2014-05-08 19:12:55,347 : INFO : PROGRESS: at 0.64% words, alpha 0.02484, 23949 words/s
2014-05-08 19:12:59,792 : INFO : PROGRESS: at 0.64% words, alpha 0.02484, 23901 words/s
2014-05-08 19:13:07,595 : INFO : PROGRESS: at 0.65% words, alpha 0.02484, 23875 words/s
2014-05-08 19:13:08,800 : INFO : PROGRESS: at 0.66% words, alpha 0.02484, 23984 words/s
2014-05-08 19:13:13,046 : INFO : PROGRESS: at 0.66% words, alpha 0.02484, 23951 words/s
2014-05-08 19:13:14,690 : INFO : PROGRESS: at 0.67% words, alpha 0.02484, 24044 words/s
2014-05-08 19:13:20,724 : INFO : PROGRESS: at 0.67% words, alpha 0.02484, 23929 words/s
2014-05-08 19:13:22,758 : INFO : PROGRESS: at 0.68% words, alpha 0.02484, 24007 words/s
2014-05-08 19:13:25,428 : INFO : PROGRESS: at 0.68% words, alpha 0.02483, 24034 words/s
2014-05-08 19:13:28,018 : INFO : PROGRESS: at 0.68% words, alpha 0.02483, 24075 words/s
2014-05-08 19:13:35,120 : INFO : PROGRESS: at 0.69% words, alpha 0.02483, 23920 words/s
2014-05-08 19:13:36,785 : INFO : PROGRESS: at 0.69% words, alpha 0.02483, 24012 words/s
2014-05-08 19:13:38,527 : INFO : PROGRESS: at 0.70% words, alpha 0.02483, 24090 words/s
2014-05-08 19:13:41,400 : INFO : PROGRESS: at 0.70% words, alpha 0.02483, 24119 words/s
2014-05-08 19:13:46,130 : INFO : PROGRESS: at 0.71% words, alpha 0.02483, 24050 words/s
2014-05-08 19:13:49,517 : INFO : PROGRESS: at 0.71% words, alpha 0.02483, 24054 words/s
2014-05-08 19:13:52,033 : INFO : PROGRESS: at 0.72% words, alpha 0.02483, 24105 words/s
2014-05-08 19:13:55,969 : INFO : PROGRESS: at 0.72% words, alpha 0.02482, 24087 words/s
2014-05-08 19:13:59,645 : INFO : PROGRESS: at 0.73% words, alpha 0.02482, 24080 words/s
2014-05-08 19:14:04,385 : INFO : PROGRESS: at 0.73% words, alpha 0.02482, 24033 words/s
2014-05-08 19:14:06,303 : INFO : PROGRESS: at 0.73% words, alpha 0.02482, 24096 words/s
2014-05-08 19:14:10,260 : INFO : PROGRESS: at 0.74% words, alpha 0.02482, 24080 words/s
2014-05-08 19:14:13,133 : INFO : PROGRESS: at 0.74% words, alpha 0.02482, 24102 words/s
2014-05-08 19:14:17,409 : INFO : PROGRESS: at 0.75% words, alpha 0.02482, 24078 words/s
2014-05-08 19:14:18,762 : INFO : PROGRESS: at 0.75% words, alpha 0.02482, 24138 words/s
2014-05-08 19:14:21,100 : INFO : PROGRESS: at 0.76% words, alpha 0.02481, 24314 words/s
2014-05-08 19:14:23,977 : INFO : PROGRESS: at 0.76% words, alpha 0.02481, 24256 words/s
2014-05-08 19:14:25,008 : INFO : PROGRESS: at 0.77% words, alpha 0.02481, 24392 words/s
2014-05-08 19:14:29,417 : INFO : PROGRESS: at 0.77% words, alpha 0.02481, 24263 words/s
2014-05-08 19:14:30,508 : INFO : PROGRESS: at 0.77% words, alpha 0.02481, 24329 words/s
2014-05-08 19:14:32,370 : INFO : PROGRESS: at 0.78% words, alpha 0.02481, 24329 words/s
2014-05-08 19:14:34,876 : INFO : PROGRESS: at 0.78% words, alpha 0.02481, 24272 words/s
2014-05-08 19:14:43,597 : INFO : PROGRESS: at 0.78% words, alpha 0.02481, 24161 words/s
2014-05-08 19:14:47,704 : INFO : PROGRESS: at 0.79% words, alpha 0.02481, 24137 words/s
2014-05-08 19:14:55,865 : INFO : PROGRESS: at 0.80% words, alpha 0.02480, 24224 words/s
2014-05-08 19:14:58,673 : INFO : PROGRESS: at 0.81% words, alpha 0.02480, 24231 words/s
2014-05-08 19:15:00,580 : INFO : PROGRESS: at 0.81% words, alpha 0.02480, 24295 words/s
2014-05-08 19:15:07,206 : INFO : PROGRESS: at 0.82% words, alpha 0.02480, 24307 words/s
2014-05-08 19:15:09,587 : INFO : PROGRESS: at 0.82% words, alpha 0.02480, 24337 words/s
2014-05-08 19:15:11,449 : INFO : PROGRESS: at 0.83% words, alpha 0.02480, 24473 words/s
2014-05-08 19:15:18,572 : INFO : PROGRESS: at 0.84% words, alpha 0.02480, 24327 words/s
2014-05-08 19:15:19,642 : INFO : PROGRESS: at 0.84% words, alpha 0.02479, 24496 words/s
2014-05-08 19:15:22,714 : INFO : PROGRESS: at 0.85% words, alpha 0.02479, 24529 words/s
2014-05-08 19:15:24,153 : INFO : PROGRESS: at 0.85% words, alpha 0.02479, 24548 words/s
2014-05-08 19:15:26,695 : INFO : PROGRESS: at 0.85% words, alpha 0.02479, 24567 words/s
2014-05-08 19:15:29,873 : INFO : PROGRESS: at 0.86% words, alpha 0.02479, 24577 words/s
2014-05-08 19:15:31,276 : INFO : PROGRESS: at 0.86% words, alpha 0.02479, 24567 words/s
2014-05-08 19:15:33,727 : INFO : PROGRESS: at 0.86% words, alpha 0.02479, 24601 words/s
2014-05-08 19:15:34,935 : INFO : PROGRESS: at 0.87% words, alpha 0.02479, 24713 words/s
2014-05-08 19:15:36,956 : INFO : PROGRESS: at 0.87% words, alpha 0.02478, 24695 words/s
2014-05-08 19:15:39,665 : INFO : PROGRESS: at 0.87% words, alpha 0.02478, 24649 words/s
2014-05-08 19:15:42,600 : INFO : PROGRESS: at 0.88% words, alpha 0.02478, 24640 words/s
2014-05-08 19:15:45,986 : INFO : PROGRESS: at 0.88% words, alpha 0.02478, 24650 words/s
2014-05-08 19:15:48,801 : INFO : PROGRESS: at 0.89% words, alpha 0.02478, 24637 words/s
2014-05-08 19:15:53,355 : INFO : PROGRESS: at 0.89% words, alpha 0.02478, 24544 words/s
2014-05-08 19:15:54,429 : INFO : PROGRESS: at 0.89% words, alpha 0.02478, 24619 words/s
2014-05-08 19:15:55,809 : INFO : PROGRESS: at 0.90% words, alpha 0.02478, 24691 words/s
2014-05-08 19:15:59,868 : INFO : PROGRESS: at 0.90% words, alpha 0.02478, 24647 words/s
2014-05-08 19:16:02,747 : INFO : PROGRESS: at 0.90% words, alpha 0.02478, 24630 words/s
2014-05-08 19:16:07,168 : INFO : PROGRESS: at 0.91% words, alpha 0.02478, 24590 words/s
2014-05-08 19:16:08,798 : INFO : PROGRESS: at 0.91% words, alpha 0.02478, 24651 words/s
2014-05-08 19:16:12,681 : INFO : PROGRESS: at 0.92% words, alpha 0.02477, 24634 words/s
2014-05-08 19:16:16,134 : INFO : PROGRESS: at 0.92% words, alpha 0.02477, 24687 words/s
2014-05-08 19:16:20,650 : INFO : PROGRESS: at 0.93% words, alpha 0.02477, 24638 words/s
2014-05-08 19:16:24,872 : INFO : PROGRESS: at 0.93% words, alpha 0.02477, 24603 words/s
2014-05-08 19:16:28,408 : INFO : PROGRESS: at 0.94% words, alpha 0.02477, 24709 words/s
2014-05-08 19:16:31,237 : INFO : PROGRESS: at 0.94% words, alpha 0.02477, 24711 words/s
2014-05-08 19:16:35,670 : INFO : PROGRESS: at 0.95% words, alpha 0.02477, 24656 words/s
2014-05-08 19:16:36,700 : INFO : PROGRESS: at 0.95% words, alpha 0.02477, 24726 words/s
2014-05-08 19:16:40,142 : INFO : PROGRESS: at 0.96% words, alpha 0.02476, 24713 words/s
2014-05-08 19:16:43,870 : INFO : PROGRESS: at 0.96% words, alpha 0.02476, 24694 words/s
2014-05-08 19:16:47,981 : INFO : PROGRESS: at 0.96% words, alpha 0.02476, 24667 words/s
2014-05-08 19:16:53,107 : INFO : PROGRESS: at 0.97% words, alpha 0.02476, 24721 words/s
2014-05-08 19:16:55,420 : INFO : PROGRESS: at 0.98% words, alpha 0.02476, 24752 words/s
2014-05-08 19:16:59,488 : INFO : PROGRESS: at 0.98% words, alpha 0.02476, 24718 words/s
2014-05-08 19:17:03,374 : INFO : PROGRESS: at 0.99% words, alpha 0.02476, 24797 words/s
2014-05-08 19:17:05,998 : INFO : PROGRESS: at 0.99% words, alpha 0.02476, 24807 words/s
2014-05-08 19:17:10,451 : INFO : PROGRESS: at 1.00% words, alpha 0.02475, 24757 words/s
2014-05-08 19:17:14,917 : INFO : PROGRESS: at 1.01% words, alpha 0.02475, 24806 words/s
2014-05-08 19:17:17,067 : INFO : PROGRESS: at 1.01% words, alpha 0.02475, 24828 words/s
2014-05-08 19:17:21,170 : INFO : PROGRESS: at 1.01% words, alpha 0.02475, 24786 words/s
2014-05-08 19:17:24,160 : INFO : PROGRESS: at 1.02% words, alpha 0.02475, 24798 words/s
2014-05-08 19:17:27,296 : INFO : PROGRESS: at 1.02% words, alpha 0.02475, 24803 words/s
2014-05-08 19:17:29,078 : INFO : PROGRESS: at 1.03% words, alpha 0.02475, 24849 words/s
2014-05-08 19:17:32,314 : INFO : PROGRESS: at 1.03% words, alpha 0.02475, 24841 words/s
2014-05-08 19:17:34,668 : INFO : PROGRESS: at 1.03% words, alpha 0.02475, 24865 words/s
2014-05-08 19:17:37,148 : INFO : PROGRESS: at 1.04% words, alpha 0.02474, 24877 words/s
2014-05-08 19:17:40,590 : INFO : PROGRESS: at 1.04% words, alpha 0.02474, 24878 words/s
2014-05-08 19:17:43,242 : INFO : PROGRESS: at 1.05% words, alpha 0.02474, 24895 words/s
2014-05-08 19:17:46,822 : INFO : PROGRESS: at 1.05% words, alpha 0.02474, 24889 words/s
2014-05-08 19:17:50,689 : INFO : PROGRESS: at 1.06% words, alpha 0.02474, 24947 words/s
2014-05-08 19:17:52,840 : INFO : PROGRESS: at 1.06% words, alpha 0.02474, 24961 words/s
2014-05-08 19:17:58,227 : INFO : PROGRESS: at 1.07% words, alpha 0.02474, 24890 words/s
2014-05-08 19:18:01,415 : INFO : PROGRESS: at 1.08% words, alpha 0.02474, 24996 words/s
2014-05-08 19:18:07,490 : INFO : PROGRESS: at 1.08% words, alpha 0.02473, 24968 words/s
2014-05-08 19:18:09,417 : INFO : PROGRESS: at 1.09% words, alpha 0.02473, 24999 words/s
2014-05-08 19:18:12,680 : INFO : PROGRESS: at 1.09% words, alpha 0.02473, 24995 words/s
2014-05-08 19:18:18,105 : INFO : PROGRESS: at 1.10% words, alpha 0.02473, 25027 words/s
2014-05-08 19:18:20,560 : INFO : PROGRESS: at 1.10% words, alpha 0.02473, 25052 words/s
2014-05-08 19:18:22,712 : INFO : PROGRESS: at 1.11% words, alpha 0.02473, 25076 words/s
2014-05-08 19:18:24,608 : INFO : PROGRESS: at 1.11% words, alpha 0.02473, 25117 words/s
2014-05-08 19:18:29,478 : INFO : PROGRESS: at 1.12% words, alpha 0.02473, 25064 words/s
2014-05-08 19:18:31,517 : INFO : PROGRESS: at 1.12% words, alpha 0.02472, 25093 words/s
2014-05-08 19:18:33,429 : INFO : PROGRESS: at 1.12% words, alpha 0.02472, 25125 words/s
2014-05-08 19:18:35,796 : INFO : PROGRESS: at 1.13% words, alpha 0.02472, 25146 words/s
2014-05-08 19:18:39,516 : INFO : PROGRESS: at 1.13% words, alpha 0.02472, 25112 words/s
2014-05-08 19:18:43,604 : INFO : PROGRESS: at 1.14% words, alpha 0.02472, 25085 words/s
2014-05-08 19:18:45,965 : INFO : PROGRESS: at 1.14% words, alpha 0.02472, 25113 words/s
2014-05-08 19:18:49,471 : INFO : PROGRESS: at 1.15% words, alpha 0.02472, 25180 words/s
2014-05-08 19:18:53,878 : INFO : PROGRESS: at 1.15% words, alpha 0.02472, 25144 words/s
2014-05-08 19:18:55,839 : INFO : PROGRESS: at 1.16% words, alpha 0.02471, 25181 words/s
2014-05-08 19:18:59,891 : INFO : PROGRESS: at 1.17% words, alpha 0.02471, 25256 words/s
2014-05-08 19:19:04,251 : INFO : PROGRESS: at 1.17% words, alpha 0.02471, 25225 words/s
2014-05-08 19:19:05,725 : INFO : PROGRESS: at 1.17% words, alpha 0.02471, 25269 words/s
2014-05-08 19:19:07,177 : INFO : PROGRESS: at 1.18% words, alpha 0.02471, 25319 words/s
2014-05-08 19:19:10,271 : INFO : PROGRESS: at 1.18% words, alpha 0.02471, 25326 words/s
2014-05-08 19:19:14,342 : INFO : PROGRESS: at 1.19% words, alpha 0.02471, 25304 words/s
2014-05-08 19:19:19,275 : INFO : PROGRESS: at 1.20% words, alpha 0.02470, 25399 words/s
2014-05-08 19:19:23,409 : INFO : PROGRESS: at 1.20% words, alpha 0.02470, 25361 words/s
2014-05-08 19:19:24,862 : INFO : PROGRESS: at 1.21% words, alpha 0.02470, 25457 words/s
2014-05-08 19:19:32,741 : INFO : PROGRESS: at 1.22% words, alpha 0.02470, 25394 words/s
2014-05-08 19:19:34,066 : INFO : PROGRESS: at 1.22% words, alpha 0.02470, 25432 words/s
2014-05-08 19:19:42,273 : INFO : PROGRESS: at 1.23% words, alpha 0.02470, 25453 words/s
2014-05-08 19:19:43,786 : INFO : PROGRESS: at 1.24% words, alpha 0.02469, 25487 words/s
2014-05-08 19:19:52,398 : INFO : PROGRESS: at 1.25% words, alpha 0.02469, 25492 words/s
2014-05-08 19:19:53,601 : INFO : PROGRESS: at 1.25% words, alpha 0.02469, 25534 words/s
2014-05-08 19:19:54,847 : INFO : PROGRESS: at 1.26% words, alpha 0.02469, 25662 words/s
2014-05-08 19:20:01,480 : INFO : PROGRESS: at 1.26% words, alpha 0.02469, 25539 words/s
2014-05-08 19:20:04,499 : INFO : PROGRESS: at 1.27% words, alpha 0.02469, 25537 words/s
2014-05-08 19:20:10,041 : INFO : PROGRESS: at 1.28% words, alpha 0.02468, 25613 words/s
2014-05-08 19:20:16,179 : INFO : PROGRESS: at 1.28% words, alpha 0.02468, 25527 words/s
2014-05-08 19:20:17,584 : INFO : PROGRESS: at 1.29% words, alpha 0.02468, 25654 words/s
2014-05-08 19:20:21,604 : INFO : PROGRESS: at 1.29% words, alpha 0.02468, 25624 words/s
2014-05-08 19:20:26,728 : INFO : PROGRESS: at 1.30% words, alpha 0.02468, 25557 words/s
2014-05-08 19:20:32,240 : INFO : PROGRESS: at 1.31% words, alpha 0.02468, 25649 words/s
2014-05-08 19:20:37,254 : INFO : PROGRESS: at 1.31% words, alpha 0.02467, 25593 words/s
2014-05-08 19:20:38,943 : INFO : PROGRESS: at 1.32% words, alpha 0.02467, 25714 words/s
2014-05-08 19:20:42,740 : INFO : PROGRESS: at 1.33% words, alpha 0.02467, 25696 words/s
2014-05-08 19:20:46,057 : INFO : PROGRESS: at 1.33% words, alpha 0.02467, 25693 words/s
2014-05-08 19:20:47,137 : INFO : PROGRESS: at 1.34% words, alpha 0.02467, 25751 words/s
2014-05-08 19:20:51,222 : INFO : PROGRESS: at 1.34% words, alpha 0.02467, 25801 words/s
2014-05-08 19:20:55,492 : INFO : PROGRESS: at 1.35% words, alpha 0.02467, 25773 words/s
2014-05-08 19:20:58,522 : INFO : PROGRESS: at 1.36% words, alpha 0.02466, 25927 words/s
2014-05-08 19:21:03,124 : INFO : PROGRESS: at 1.36% words, alpha 0.02466, 25850 words/s
2014-05-08 19:21:07,553 : INFO : PROGRESS: at 1.37% words, alpha 0.02466, 25816 words/s
2014-05-08 19:21:14,073 : INFO : PROGRESS: at 1.38% words, alpha 0.02466, 25871 words/s
2014-05-08 19:21:15,773 : INFO : PROGRESS: at 1.38% words, alpha 0.02466, 25886 words/s
2014-05-08 19:21:17,484 : INFO : PROGRESS: at 1.39% words, alpha 0.02466, 25909 words/s
2014-05-08 19:21:25,523 : INFO : PROGRESS: at 1.39% words, alpha 0.02466, 25848 words/s
2014-05-08 19:21:27,227 : INFO : PROGRESS: at 1.40% words, alpha 0.02465, 25877 words/s
2014-05-08 19:21:29,745 : INFO : PROGRESS: at 1.40% words, alpha 0.02465, 25894 words/s
2014-05-08 19:21:35,581 : INFO : PROGRESS: at 1.41% words, alpha 0.02465, 25906 words/s
2014-05-08 19:21:39,599 : INFO : PROGRESS: at 1.42% words, alpha 0.02465, 25947 words/s
2014-05-08 19:21:40,741 : INFO : PROGRESS: at 1.42% words, alpha 0.02465, 25996 words/s
2014-05-08 19:21:43,859 : INFO : PROGRESS: at 1.43% words, alpha 0.02465, 25968 words/s
2014-05-08 19:21:46,863 : INFO : PROGRESS: at 1.43% words, alpha 0.02465, 25968 words/s
2014-05-08 19:21:49,906 : INFO : PROGRESS: at 1.43% words, alpha 0.02465, 25970 words/s
2014-05-08 19:21:53,787 : INFO : PROGRESS: at 1.44% words, alpha 0.02464, 26022 words/s
2014-05-08 19:21:55,338 : INFO : PROGRESS: at 1.45% words, alpha 0.02464, 26051 words/s
2014-05-08 19:21:58,623 : INFO : PROGRESS: at 1.45% words, alpha 0.02464, 26046 words/s
2014-05-08 19:22:01,340 : INFO : PROGRESS: at 1.46% words, alpha 0.02464, 26121 words/s
2014-05-08 19:22:05,674 : INFO : PROGRESS: at 1.46% words, alpha 0.02464, 26099 words/s
2014-05-08 19:22:08,309 : INFO : PROGRESS: at 1.47% words, alpha 0.02464, 26100 words/s
2014-05-08 19:22:09,364 : INFO : PROGRESS: at 1.47% words, alpha 0.02464, 26144 words/s
2014-05-08 19:22:11,159 : INFO : PROGRESS: at 1.47% words, alpha 0.02464, 26170 words/s
2014-05-08 19:22:14,098 : INFO : PROGRESS: at 1.48% words, alpha 0.02463, 26173 words/s
2014-05-08 19:22:16,400 : INFO : PROGRESS: at 1.48% words, alpha 0.02463, 26185 words/s
2014-05-08 19:22:19,704 : INFO : PROGRESS: at 1.49% words, alpha 0.02463, 26251 words/s
2014-05-08 19:22:21,041 : INFO : PROGRESS: at 1.49% words, alpha 0.02463, 26268 words/s
2014-05-08 19:22:25,885 : INFO : PROGRESS: at 1.50% words, alpha 0.02463, 26214 words/s
2014-05-08 19:22:29,865 : INFO : PROGRESS: at 1.51% words, alpha 0.02463, 26257 words/s
2014-05-08 19:22:34,638 : INFO : PROGRESS: at 1.51% words, alpha 0.02462, 26262 words/s
2014-05-08 19:22:36,791 : INFO : PROGRESS: at 1.52% words, alpha 0.02462, 26323 words/s
2014-05-08 19:22:43,909 : INFO : PROGRESS: at 1.53% words, alpha 0.02462, 26261 words/s
2014-05-08 19:22:45,979 : INFO : PROGRESS: at 1.53% words, alpha 0.02462, 26337 words/s
2014-05-08 19:22:47,508 : INFO : PROGRESS: at 1.54% words, alpha 0.02462, 26374 words/s
2014-05-08 19:22:51,592 : INFO : PROGRESS: at 1.54% words, alpha 0.02462, 26339 words/s
2014-05-08 19:22:53,824 : INFO : PROGRESS: at 1.55% words, alpha 0.02462, 26402 words/s
2014-05-08 19:23:00,723 : INFO : PROGRESS: at 1.56% words, alpha 0.02461, 26363 words/s
2014-05-08 19:23:02,383 : INFO : PROGRESS: at 1.56% words, alpha 0.02461, 26379 words/s
2014-05-08 19:23:08,200 : INFO : PROGRESS: at 1.57% words, alpha 0.02461, 26416 words/s
2014-05-08 19:23:10,309 : INFO : PROGRESS: at 1.58% words, alpha 0.02461, 26540 words/s
2014-05-08 19:23:14,531 : INFO : PROGRESS: at 1.58% words, alpha 0.02461, 26490 words/s
2014-05-08 19:23:16,547 : INFO : PROGRESS: at 1.59% words, alpha 0.02461, 26552 words/s
2014-05-08 19:23:18,931 : INFO : PROGRESS: at 1.59% words, alpha 0.02461, 26556 words/s
2014-05-08 19:23:22,181 : INFO : PROGRESS: at 1.60% words, alpha 0.02460, 26532 words/s
2014-05-08 19:23:23,199 : INFO : PROGRESS: at 1.60% words, alpha 0.02460, 26555 words/s
2014-05-08 19:23:24,558 : INFO : PROGRESS: at 1.60% words, alpha 0.02460, 26625 words/s
2014-05-08 19:23:28,282 : INFO : PROGRESS: at 1.61% words, alpha 0.02460, 26579 words/s
2014-05-08 19:23:29,616 : INFO : PROGRESS: at 1.61% words, alpha 0.02460, 26589 words/s
2014-05-08 19:23:31,703 : INFO : PROGRESS: at 1.62% words, alpha 0.02460, 26637 words/s
2014-05-08 19:23:36,269 : INFO : PROGRESS: at 1.62% words, alpha 0.02460, 26577 words/s
2014-05-08 19:23:38,159 : INFO : PROGRESS: at 1.62% words, alpha 0.02460, 26583 words/s
2014-05-08 19:23:39,315 : INFO : PROGRESS: at 1.62% words, alpha 0.02460, 26607 words/s
2014-05-08 19:23:42,049 : INFO : PROGRESS: at 1.63% words, alpha 0.02460, 26604 words/s
2014-05-08 19:23:46,112 : INFO : PROGRESS: at 1.63% words, alpha 0.02460, 26570 words/s
2014-05-08 19:23:47,973 : INFO : PROGRESS: at 1.64% words, alpha 0.02459, 26587 words/s
2014-05-08 19:23:51,885 : INFO : PROGRESS: at 1.64% words, alpha 0.02459, 26621 words/s
2014-05-08 19:23:54,250 : INFO : PROGRESS: at 1.65% words, alpha 0.02459, 26621 words/s
2014-05-08 19:23:55,728 : INFO : PROGRESS: at 1.65% words, alpha 0.02459, 26644 words/s
2014-05-08 19:23:57,307 : INFO : PROGRESS: at 1.65% words, alpha 0.02459, 26669 words/s
2014-05-08 19:23:59,348 : INFO : PROGRESS: at 1.66% words, alpha 0.02459, 26682 words/s
2014-05-08 19:24:01,052 : INFO : PROGRESS: at 1.66% words, alpha 0.02459, 26711 words/s
2014-05-08 19:24:02,940 : INFO : PROGRESS: at 1.67% words, alpha 0.02459, 26731 words/s
2014-05-08 19:24:03,964 : INFO : PROGRESS: at 1.67% words, alpha 0.02459, 26767 words/s
2014-05-08 19:24:06,175 : INFO : PROGRESS: at 1.67% words, alpha 0.02459, 26767 words/s
2014-05-08 19:24:08,628 : INFO : PROGRESS: at 1.68% words, alpha 0.02458, 26761 words/s
2014-05-08 19:24:12,882 : INFO : PROGRESS: at 1.68% words, alpha 0.02458, 26771 words/s
2014-05-08 19:24:15,387 : INFO : PROGRESS: at 1.69% words, alpha 0.02458, 26784 words/s
2014-05-08 19:24:17,037 : INFO : PROGRESS: at 1.69% words, alpha 0.02458, 26806 words/s
2014-05-08 19:24:18,200 : INFO : PROGRESS: at 1.69% words, alpha 0.02458, 26852 words/s
2014-05-08 19:24:20,100 : INFO : PROGRESS: at 1.70% words, alpha 0.02458, 26868 words/s
2014-05-08 19:24:22,580 : INFO : PROGRESS: at 1.70% words, alpha 0.02458, 26874 words/s
2014-05-08 19:24:25,177 : INFO : PROGRESS: at 1.71% words, alpha 0.02458, 26922 words/s
2014-05-08 19:24:29,390 : INFO : PROGRESS: at 1.71% words, alpha 0.02458, 26901 words/s
2014-05-08 19:24:31,897 : INFO : PROGRESS: at 1.72% words, alpha 0.02457, 26911 words/s
2014-05-08 19:24:34,396 : INFO : PROGRESS: at 1.73% words, alpha 0.02457, 26980 words/s
2014-05-08 19:24:36,777 : INFO : PROGRESS: at 1.73% words, alpha 0.02457, 26984 words/s
2014-05-08 19:24:38,450 : INFO : PROGRESS: at 1.73% words, alpha 0.02457, 27003 words/s
2014-05-08 19:24:39,476 : INFO : PROGRESS: at 1.74% words, alpha 0.02457, 27040 words/s
2014-05-08 19:24:41,266 : INFO : PROGRESS: at 1.74% words, alpha 0.02457, 27057 words/s
2014-05-08 19:24:43,694 : INFO : PROGRESS: at 1.74% words, alpha 0.02457, 27044 words/s
2014-05-08 19:24:44,712 : INFO : PROGRESS: at 1.75% words, alpha 0.02457, 27076 words/s
2014-05-08 19:24:47,379 : INFO : PROGRESS: at 1.75% words, alpha 0.02457, 27079 words/s
2014-05-08 19:24:49,109 : INFO : PROGRESS: at 1.76% words, alpha 0.02456, 27109 words/s
2014-05-08 19:24:51,222 : INFO : PROGRESS: at 1.76% words, alpha 0.02456, 27125 words/s
2014-05-08 19:24:54,051 : INFO : PROGRESS: at 1.77% words, alpha 0.02456, 27192 words/s
2014-05-08 19:24:57,740 : INFO : PROGRESS: at 1.78% words, alpha 0.02456, 27216 words/s
2014-05-08 19:25:00,660 : INFO : PROGRESS: at 1.78% words, alpha 0.02456, 27263 words/s
2014-05-08 19:25:05,037 : INFO : PROGRESS: at 1.79% words, alpha 0.02456, 27271 words/s
2014-05-08 19:25:06,906 : INFO : PROGRESS: at 1.80% words, alpha 0.02455, 27333 words/s
2014-05-08 19:25:10,981 : INFO : PROGRESS: at 1.80% words, alpha 0.02455, 27310 words/s
2014-05-08 19:25:13,006 : INFO : PROGRESS: at 1.80% words, alpha 0.02455, 27307 words/s
2014-05-08 19:25:14,872 : INFO : PROGRESS: at 1.81% words, alpha 0.02455, 27305 words/s
2014-05-08 19:25:18,687 : INFO : PROGRESS: at 1.81% words, alpha 0.02455, 27333 words/s
2014-05-08 19:25:20,941 : INFO : PROGRESS: at 1.82% words, alpha 0.02455, 27339 words/s
2014-05-08 19:25:22,422 : INFO : PROGRESS: at 1.82% words, alpha 0.02455, 27354 words/s
2014-05-08 19:25:23,642 : INFO : PROGRESS: at 1.82% words, alpha 0.02455, 27383 words/s
2014-05-08 19:25:27,694 : INFO : PROGRESS: at 1.83% words, alpha 0.02455, 27353 words/s
2014-05-08 19:25:29,634 : INFO : PROGRESS: at 1.83% words, alpha 0.02454, 27365 words/s
2014-05-08 19:25:35,437 : INFO : PROGRESS: at 1.84% words, alpha 0.02454, 27415 words/s
2014-05-08 19:25:37,554 : INFO : PROGRESS: at 1.85% words, alpha 0.02454, 27425 words/s
2014-05-08 19:25:43,834 : INFO : PROGRESS: at 1.86% words, alpha 0.02454, 27447 words/s
2014-05-08 19:25:46,373 : INFO : PROGRESS: at 1.86% words, alpha 0.02454, 27446 words/s
2014-05-08 19:25:51,103 : INFO : PROGRESS: at 1.87% words, alpha 0.02453, 27486 words/s
2014-05-08 19:25:53,635 : INFO : PROGRESS: at 1.88% words, alpha 0.02453, 27540 words/s
2014-05-08 19:25:57,305 : INFO : PROGRESS: at 1.89% words, alpha 0.02453, 27561 words/s
2014-05-08 19:25:59,398 : INFO : PROGRESS: at 1.89% words, alpha 0.02453, 27601 words/s
2014-05-08 19:26:02,040 : INFO : PROGRESS: at 1.90% words, alpha 0.02453, 27604 words/s
2014-05-08 19:26:03,637 : INFO : PROGRESS: at 1.90% words, alpha 0.02453, 27603 words/s
2014-05-08 19:26:04,690 : INFO : PROGRESS: at 1.90% words, alpha 0.02453, 27636 words/s
2014-05-08 19:26:06,460 : INFO : PROGRESS: at 1.90% words, alpha 0.02453, 27629 words/s
2014-05-08 19:26:10,798 : INFO : PROGRESS: at 1.91% words, alpha 0.02452, 27643 words/s
2014-05-08 19:26:12,199 : INFO : PROGRESS: at 1.91% words, alpha 0.02453, 27667 words/s
2014-05-08 19:26:16,708 : INFO : PROGRESS: at 1.92% words, alpha 0.02452, 27687 words/s
2014-05-08 19:26:17,949 : INFO : PROGRESS: at 1.93% words, alpha 0.02452, 27753 words/s
2014-05-08 19:26:20,513 : INFO : PROGRESS: at 1.93% words, alpha 0.02452, 27779 words/s
2014-05-08 19:26:23,028 : INFO : PROGRESS: at 1.94% words, alpha 0.02452, 27770 words/s
2014-05-08 19:26:27,017 : INFO : PROGRESS: at 1.94% words, alpha 0.02452, 27783 words/s
2014-05-08 19:26:29,169 : INFO : PROGRESS: at 1.95% words, alpha 0.02452, 27790 words/s
2014-05-08 19:26:32,675 : INFO : PROGRESS: at 1.96% words, alpha 0.02452, 27814 words/s
2014-05-08 19:26:35,459 : INFO : PROGRESS: at 1.96% words, alpha 0.02451, 27829 words/s
2014-05-08 19:26:38,542 : INFO : PROGRESS: at 1.97% words, alpha 0.02451, 27836 words/s
2014-05-08 19:26:41,726 : INFO : PROGRESS: at 1.97% words, alpha 0.02451, 27838 words/s
2014-05-08 19:26:43,919 : INFO : PROGRESS: at 1.97% words, alpha 0.02451, 27842 words/s
2014-05-08 19:26:45,881 : INFO : PROGRESS: at 1.98% words, alpha 0.02451, 27860 words/s
2014-05-08 19:26:46,979 : INFO : PROGRESS: at 1.98% words, alpha 0.02451, 27896 words/s
2014-05-08 19:26:49,368 : INFO : PROGRESS: at 1.99% words, alpha 0.02451, 27917 words/s
2014-05-08 19:26:51,198 : INFO : PROGRESS: at 1.99% words, alpha 0.02451, 27942 words/s
2014-05-08 19:26:52,606 : INFO : PROGRESS: at 2.00% words, alpha 0.02451, 27970 words/s
2014-05-08 19:26:53,978 : INFO : PROGRESS: at 2.00% words, alpha 0.02450, 28002 words/s
2014-05-08 19:26:56,308 : INFO : PROGRESS: at 2.00% words, alpha 0.02450, 28017 words/s
2014-05-08 19:26:58,149 : INFO : PROGRESS: at 2.01% words, alpha 0.02450, 28042 words/s
2014-05-08 19:26:59,179 : INFO : PROGRESS: at 2.01% words, alpha 0.02450, 28079 words/s
2014-05-08 19:27:02,118 : INFO : PROGRESS: at 2.02% words, alpha 0.02450, 28113 words/s
2014-05-08 19:27:03,503 : INFO : PROGRESS: at 2.02% words, alpha 0.02450, 28130 words/s
2014-05-08 19:27:04,649 : INFO : PROGRESS: at 2.03% words, alpha 0.02450, 28148 words/s
2014-05-08 19:27:06,425 : INFO : PROGRESS: at 2.03% words, alpha 0.02450, 28168 words/s
2014-05-08 19:27:08,008 : INFO : PROGRESS: at 2.03% words, alpha 0.02449, 28179 words/s
2014-05-08 19:27:09,166 : INFO : PROGRESS: at 2.04% words, alpha 0.02449, 28192 words/s
2014-05-08 19:27:11,372 : INFO : PROGRESS: at 2.04% words, alpha 0.02449, 28226 words/s
2014-05-08 19:27:14,197 : INFO : PROGRESS: at 2.05% words, alpha 0.02449, 28240 words/s
2014-05-08 19:27:15,303 : INFO : PROGRESS: at 2.05% words, alpha 0.02449, 28264 words/s
2014-05-08 19:27:16,477 : INFO : PROGRESS: at 2.05% words, alpha 0.02449, 28277 words/s
2014-05-08 19:27:17,889 : INFO : PROGRESS: at 2.06% words, alpha 0.02449, 28286 words/s
2014-05-08 19:27:20,066 : INFO : PROGRESS: at 2.06% words, alpha 0.02449, 28280 words/s
2014-05-08 19:27:22,361 : INFO : PROGRESS: at 2.06% words, alpha 0.02449, 28309 words/s
2014-05-08 19:27:23,617 : INFO : PROGRESS: at 2.07% words, alpha 0.02449, 28327 words/s
2014-05-08 19:27:24,647 : INFO : PROGRESS: at 2.07% words, alpha 0.02449, 28342 words/s
2014-05-08 19:27:25,853 : INFO : PROGRESS: at 2.07% words, alpha 0.02448, 28352 words/s
2014-05-08 19:27:26,933 : INFO : PROGRESS: at 2.08% words, alpha 0.02448, 28362 words/s
2014-05-08 19:27:28,773 : INFO : PROGRESS: at 2.08% words, alpha 0.02448, 28360 words/s
2014-05-08 19:27:30,069 : INFO : PROGRESS: at 2.08% words, alpha 0.02448, 28377 words/s
2014-05-08 19:27:31,252 : INFO : PROGRESS: at 2.08% words, alpha 0.02448, 28388 words/s
2014-05-08 19:27:33,800 : INFO : PROGRESS: at 2.09% words, alpha 0.02448, 28422 words/s
2014-05-08 19:27:35,639 : INFO : PROGRESS: at 2.09% words, alpha 0.02448, 28416 words/s
2014-05-08 19:27:36,882 : INFO : PROGRESS: at 2.10% words, alpha 0.02448, 28432 words/s
2014-05-08 19:27:38,729 : INFO : PROGRESS: at 2.10% words, alpha 0.02448, 28449 words/s

h3im...@gmail.com

unread,
May 9, 2014, 7:08:08 AM5/9/14
to gen...@googlegroups.com
Here's an update:
the training proces stopped in a strange way, while at 71%

2014-05-09 10:54:33,426 : INFO : PROGRESS: at 71.02% words, alpha 0.00725, 23360 words/s
2014-05-09 10:54:34,977 : INFO : PROGRESS: at 71.02% words, alpha 0.00725, 23360 words/s
2014-05-09 10:54:36,462 : INFO : PROGRESS: at 71.03% words, alpha 0.00724, 23361 words/s
2014-05-09 10:54:37,985 : INFO : PROGRESS: at 71.03% words, alpha 0.00724, 23361 words/s
2014-05-09 10:54:43,415 : INFO : finished iterating over Wikipedia corpus of 3529460 documents with 1883635185 positions (total 14313024 articles, 1937483820 positions before pruning articles shorter than 50 words)
2014-05-09 10:54:43,442 : INFO : reached the end of input; waiting to finish 1 outstanding jobs
2014-05-09 10:54:43,786 : INFO : PROGRESS: at 71.04% words, alpha 0.00724, 23361 words/s
2014-05-09 10:54:43,786 : INFO : training on 1331758609 words took 57008.9s, 23361 words/s
2014-05-09 10:54:43,828 : INFO : precomputing L2-norms of word weight vectors
2014-05-09 10:56:11,200 : INFO : saving Word2Vec object under word2vec_model, separately None
2014-05-09 10:56:11,256 : INFO : not storing attribute syn0norm
2014-05-09 10:56:11,256 : INFO : storing numpy array 'syn0' to word2vec_model.syn0.npy

anyway the results of the examples on gensim word2vec page are more sensible than before:

>>> model = gensim.models.Word2Vec.load("word2vec_model")

>>> model.most_similar(positive=['woman', 'king'], negative=['man'])
[('queen', 0.38105958700180054), ('nubkhesbed', 0.3764060139656067), ('nasalsa', 0.36744004487991333), ('analmaye', 0.3649202883243561), ('nasakhma', 0.3638891577720642), ('teuhe', 0.36264991760253906), ('naparaye', 0.35824206471443176), ('debsirindra', 0.35784679651260376), ('tabekenamun', 0.3571649193763733), ('tabiry', 0.35676461458206177)]

>>> model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'

>>> model.similarity('woman', 'man')
0.45423352136576706
>>> model.similarity('woman', 'woman')
1.0000000000000002
>>> model.similarity('woman', 'girl')
0.39623567721532726

All the entries in the most similar, even if they seem noise, are queens or kings names.

Anyway isn't it strage that the trining stopped at 71%?

Now i will extract the text from the articles as sugested by Radim an try to learn with it, let's see if this time the training completes at 100%
...

h3im...@gmail.com

unread,
May 9, 2014, 3:48:04 PM5/9/14
to gen...@googlegroups.com
Hello again.
I collected all the texts and launched the process on the texts with LineSentence.
It took a lot less time (about 5.5 hours).
Anyway, again it completed the job at 71% with this output:

2014-05-09 19:15:39,266 : INFO : PROGRESS: at 71.02% words, alpha 0.00725, 70668 words/s
2014-05-09 19:15:39,271 : INFO : reached the end of input; waiting to finish 8 outstanding jobs
2014-05-09 19:15:40,781 : INFO : PROGRESS: at 71.03% words, alpha 0.00724, 70667 words/s
2014-05-09 19:15:42,190 : INFO : PROGRESS: at 71.03% words, alpha 0.00724, 70668 words/s
2014-05-09 19:15:42,796 : INFO : training on 1331758609 words took 18844.6s, 70671 words/s
2014-05-09 19:15:42,796 : INFO : precomputing L2-norms of word weight vectors
2014-05-09 19:16:03,703 : INFO : saving Word2Vec object under word2vec_model, separately None
2014-05-09 19:16:03,703 : INFO : not storing attribute syn0norm
2014-05-09 19:16:03,703 : INFO : storing numpy array 'syn0' to word2vec_model.syn0.npy

What could be the reason for this? Is it ok anyway?
Thanks

Brent Payne

unread,
May 10, 2014, 2:55:33 PM5/10/14
to gen...@googlegroups.com
That is weird, I looked at the code and could not readily see why this would happen.  What version are you using?  My first guess was that the percentage was using total vocab size and not the vocab size after filtering words under the minimum count.  But this is not the case in my version 0.9.1


--

h3im...@gmail.com

unread,
May 16, 2014, 1:26:02 PM5/16/14
to gen...@googlegroups.com
Hi brent,
i was using 0.9.0, so i updated to 0.9.1 and run another time the cript, with the same result:

2014-05-16 18:44:27,775 : INFO : PROGRESS: at 71.02% words, alpha 0.00725, 70593 words/s
2014-05-16 18:44:29,607 : INFO : PROGRESS: at 71.03% words, alpha 0.00725, 70589 words/s
2014-05-16 18:44:30,356 : INFO : training on 1331758609 words took 18864.2s, 70597 words/s
2014-05-16 18:44:30,357 : INFO : precomputing L2-norms of word weight vectors
2014-05-16 18:44:50,846 : INFO : saving Word2Vec object under word2vec_model, separately None
2014-05-16 18:44:50,846 : INFO : not storing attribute syn0norm
2014-05-16 18:44:50,846 : INFO : storing numpy array 'syn0' to word2vec_model.syn0.npy

If you want to reproduce the issue you can download the wikipedia articles dump from here:
http://dumps.wikimedia.org/enwiki/20140304/

then use the following script for extracting the raw text:


import logging
import os.path
import sys

from gensim.corpora import  WikiCorpus

if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)

    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))

    # check and process input arguments

    if len(sys.argv) < 3:
        print globals()['__doc__'] % locals()
        sys.exit(1)
    inp, outp = sys.argv[1:3]
    space = " "
    i = 0

    output = open(outp,'w')


    wiki = WikiCorpus(inp, dictionary={})
    for text in wiki.get_texts():
        output.write(space.join(text) + "\n")
        i = i + 1
        if (i % 10000 == 0):
            logger.info("Saved " + str(i) + " articles")

    output.close()

    logger.info("Finished. Saved " + str(i) + " articles")


finally run the script for generating the word2vec model:



import logging
import os.path
import sys
import multiprocessing

from gensim.corpora import  WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence


if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)

    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))

    # check and process input arguments

    if len(sys.argv) < 3:
        print globals()['__doc__'] % locals()
        sys.exit(1)
    inp, outp = sys.argv[1:3]
   
    model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5, workers=multiprocessing.cpu_count())

    # trim unneeded model memory = use (much) less RAM
    model.init_sims(replace=True)

    model.save(outp)

Radim Řehůřek

unread,
May 20, 2014, 4:28:10 AM5/20/14
to gen...@googlegroups.com
Hi,

did all your CPUs kick in, after you switched to `LineSentence` input?

Finishing training at 74% is strange indeed. I can't think of a reason for this, except maybe your input iterator changes between two runs (unlikely, since it reads from a single file on disk).

We'll need to debug in more detail. h3im, can you try this and post your results:

>>> print sum(v.count for v in itervalues(model.vocab))
>>> print sum(sum(1 for word in sentence if word in model.vocab) for sentence in LineSentence(inp))

The two numbers should match.

Cheers,
Radim

h3im...@gmail.com

unread,
May 24, 2014, 4:03:33 PM5/24/14
to gen...@googlegroups.com
Hi Radim,

here are the final lines of the output:

2014-05-24 21:45:09,526 : INFO : PROGRESS: at 71.02% words, alpha 0.00725, 70742 words/s
2014-05-24 21:45:11,492 : INFO : PROGRESS: at 71.03% words, alpha 0.00725, 70737 words/s
2014-05-24 21:45:12,124 : INFO : training on 1331758609 words took 18824.7s, 70745 words/s
2014-05-24 21:45:12,125 : INFO : precomputing L2-norms of word weight vectors
1874738876
1874738876
2014-05-24 21:52:14,588 : INFO : saving Word2Vec object under word2vec_model_2, separately None
2014-05-24 21:52:14,588 : INFO : not storing attribute syn0norm
2014-05-24 21:52:14,588 : INFO : storing numpy array 'syn0' to word2vec_model_2.syn0.npy


the 2 numbers are what you wanted me to debug andthey are identical.

For obtaining them i added this lines to my code:


model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5, workers=multiprocessing.cpu_count())

    # trim unneeded model memory = use (much) less RAM
    model.init_sims(replace=True)

    print sum(v.count for v in model.vocab.itervalues())

    print sum(sum(1 for word in sentence if word in model.vocab) for sentence in LineSentence(inp))

    model.save(outp)

If you want me to do further tests please tell me, or if you want i can send you all the cripts and the texts. The text is 11gb uncompressed, id on't know how to send it to you easly, if you want me to do it just tell me ;)

Radim Řehůřek

unread,
May 25, 2014, 6:39:42 AM5/25/14
to gen...@googlegroups.com
Thanks h3im.

Both numbers are identical, so there's no problem with the dictionary/input.

I had another idea -- inside the cython code, the maximum sentence length is clipped to 1,000 words. Any sentence longer than that will only consider the first 1,000 words.

In your case, you're storing entire documents as a single sentence (1 wiki doc = 1 sentence). So this restriction may be kicking in.

Can you try increasing `DEF MAX_SENTENCE_LEN = 1000` to 10k for example, in word2vec_inner.pyx?

Or, alternatively, split documents into sentences, so each sentence is < 1,000 words long.

Let me know,
Radim


--
Radim Řehůřek, Ph.D.
consultant @ machine learning, natural language processing, data mining
skype "radimrehurek"
Message has been deleted

Ron

unread,
Jul 15, 2014, 1:45:13 PM7/15/14
to gen...@googlegroups.com
Hey, I wish to look how does Word2Vec work on my dataset. Since I dont have a bigger memory(I have 4GB RAM) and I only have one CPU, I think it will be impossible for me to train Word2Vec on WikiCorpus. Do you know if someone has posted their saved model online?

Radim Řehůřek

unread,
Jul 15, 2014, 4:25:06 PM7/15/14
to gen...@googlegroups.com
Hi Ron,

On Tuesday, July 15, 2014 8:45:13 PM UTC+3, Ron wrote:
Hey, I wish to look how does Word2Vec work on my dataset. Since I dont have a bigger memory(I have 4GB RAM) and I only have one CPU, I think it will be impossible for me to train Word2Vec on WikiCorpus. Do you know if someone has posted their saved model online?

with 4GB RAM, you may have to use a smaller vocab (maybe try ~100k words, and 300 layer size).

Otherwise nothing's stopping you from running word2vec. Make sure you have a fast BLAS installed for scipy/numpy, and you'll be fine even with a single CPU -- word2vec is fast.

I don't know anyone who uploaded their trained Wiki model online, sorry.

HTH,
Radim

sndr....@gmail.com

unread,
Mar 8, 2017, 6:30:43 PM3/8/17
to gensim
Radim

Hopefully now , after 2 years  you know somebody who made available this trained model for wikipedia pls

Lev Konstantinovskiy

unread,
Mar 8, 2017, 6:55:33 PM3/8/17
to gensim
Hi,

Not aware of a word2vec on wikipedia online but there is something better - fasttext vectors trained on wikipedia in 90 languages. Load them using gensim fasttext wrapper.

Andrey Kutuzov

unread,
Mar 8, 2017, 9:09:18 PM3/8/17
to gen...@googlegroups.com
Hi Lev,

I would not say that fastText models are necessarily better that
word2vec ones. The fastText English model from the release you mentioned
achieves 0.38 on the SimLex999 test set and 0.83 on the Google Analogy
test set (using only semantic sections).

At the same time, quite an ordinary word2vec model
(http://ltr.uio.no/semvec/en/about#models) trained on the same English
Wikipedia achieves 0.44 on SimLex999 and 0.77 on Google Analogy. It is
at least comparable performance, but note that the fastText model is 10
Gbytes in size, while the word2vec one is 300 Mbytes, 30 times smaller.

Of course, for some problems fastText models can be significantly
better, but this should be evaluated for each particular task. I am not
sure that achieving 2 more decimal points on the Google Analogy test set
is worth increasing the model size 30 times :)

P.S. To the initial poster: yes, you can download the trained Wikipedia
model from the URL I gave above.



09.03.2017 00:55, Lev Konstantinovskiy wrote:
> Hi,
>
> Not aware of a word2vec on wikipedia online but there is something
> better - fasttext vectors
> <https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md>
> trained on wikipedia in 90 languages. Load them using gensim fasttext
> wrapper
> <https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/FastText_Tutorial.ipynb>.
> logger.info <http://logger.info>("running %s" % '
> --
> You received this message because you are subscribed to the Google
> Groups "gensim" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gensim+un...@googlegroups.com
> <mailto:gensim+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Solve et coagula!
Andrey

Lev Konstantinovskiy

unread,
Mar 9, 2017, 8:26:20 AM3/9/17
to gensim
Andrey, 

Thanks for the clarification! Agree with you - I wasn't aware of that eval.

By the way, for people re-visitng this post there are links to many pre-trained models in this github ticket that will soon become a page.

Regards
Lev
Reply all
Reply to author
Forward
0 new messages