LDA model based on non-english wikipedia

449 views
Skip to first unread message

pawel.g.d...@gmail.com

unread,
Jan 4, 2016, 11:55:55 AM1/4/16
to gensim
First of all, thank you for making the amazing gensim tool!

I am looking to create a LDA model based on German (or other language) Wikipedia. Thus far, I forked gensim and plugged in my own stopwords instead of english ones and replaced the standard stemmer with a NLTK german one. I am not getting very good results with topic modelling of german texts using the thusly created model. I have a similarly created english model, and it works flawlessly on english texts.

Do you have any suggestions, as to what else I might change in the model creation process? In other words, are there any other parts of gensim (except the stopwords and stemmer), that could be language-dependent?

Cheers and happy new year!
Pawel

Christopher S. Corley

unread,
Jan 4, 2016, 1:34:26 PM1/4/16
to gensim
It should be fine to use, no matter which language, barring unicode issues during your preprocessing steps. Is there a large difference in the sizes of your corpus, in terms of number of documents, number of unique terms, or mean terms per document? Those are all things that could be detrimental to your model quality. For smaller corpora, I find it useful to set the `passes` parameter to 10-20.

--
You received this message because you are subscribed to the Google Groups "gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Christopher S. Corley

unread,
Jan 4, 2016, 1:35:39 PM1/4/16
to gensim
"Is there a large difference in the sizes of your corpus, in terms of number of documents, number of unique terms, or mean terms per document?" By this I mean after your preprocessing steps are completed. Just to make sure they aren't reducing the corpus too much. :-)

pawel.g.d...@gmail.com

unread,
Jan 4, 2016, 3:43:22 PM1/4/16
to gensim
Thanks for a quick reply!

Well, there is a significant difference to begin with - the English wikipedia dump is 12.5 GB, and German wikipedia is 4 GB.
Anyway the German corpus (the MmCorpus - TF-IDF) has ~1.65 million documents, 100 thousand features and ~245 million non-zero entries, while the English one has ~4 million documents, 100 thousand features and ~755 million non-zero entries. It seems reasonable that EN is twice-thrice the size of DE one, bearing in mind the size difference of the dumps.

When creating the models, I first ran the script provided by gensim for preprocessing the wiki (https://radimrehurek.com/gensim/wiki.html), then I used multicore lda to create the model based on the produced TF-IDF. I probably used default settings here, so just 1 pass. I will try to increase the number of passes for the DE model as you suggest and report back.

If you have any other suggestions (or parameters to tweak), I'd welcome them greatly! I'm pretty new to gensim.

Christopher S. Corley

unread,
Jan 4, 2016, 3:53:58 PM1/4/16
to gensim
Hm! I'm not sure that's much of a problem then. How are you going about judging the quality of the models?

pawel.g.d...@gmail.com

unread,
Jan 4, 2016, 4:26:21 PM1/4/16
to gensim
Alas, these are some early experiments - so far I am judging manually... I'll get some documents of varying length and feed them to the models.
The contents of the documents are known to me (for example, a software license agreement, a sports commentary, cooking recipes etc.), so I know what to more or less expect of the resultant topic distributions. For instance, as a representative example, below is an English outcome for GNU license text. This is nothing short of brilliant, and exactly what one might expect - some computer/software/technology topic, a law/legal topic and a business/financial topic.
The corresponding result for the German version of the document (this is what I am actually trying to do - compare topic distributions for same document translations across languages) is not exactly completely off - but most definitely not spot on like the English one either. This has always been the case in my tests thus far. I am aware that this is perfectly subjective view, but that's all I can offer at this point. Anyway, this has led me to believe that I might be doing something wrong (or not doing something necessary) in the preprocessing of the German text, or that I might need to tweak the parameters of the German LDA model creation. I'm currently re-running the LDA creation with 20 passes.

Correlation: 0.521205828707
INFO : [(u'data', 0.026586245172387498), (u'information', 0.015914390641238663), (u'software', 0.012704520236886358), (u'code', 0.010014952166849207), (u'systems', 0.0083467418435341917), (u'source', 0.0069796402967319607), (u'web', 0.0069714605916698967), (u'project', 0.0066037502568369912), (u'open', 0.0056400656073773451), (u'database', 0.005543017976261967)]
INFO : Correlation: 0.247048787896
INFO : [(u'court', 0.022021126302133591), (u'law', 0.020566839303454931), (u'police', 0.016155609527520199), (u'act', 0.012006214393929831), (u'case', 0.0095495376930901047), (u'legal', 0.0067887621798527198), (u'justice', 0.006500539900767625), (u'said', 0.0058811460940946664), (u'prison', 0.005536762183323129), (u'judge', 0.0055237228208750355)]
INFO : Correlation: 0.11054561579
INFO : [(u'financial', 0.0075286310032007114), (u'economic', 0.0065923151961858502), (u'act', 0.0063692287271636845), (u'million', 0.0062029969301934012), (u'bank', 0.0054508460950598945), (u'tax', 0.0050785230719152565), (u'management', 0.0048578108151419016), (u'policy', 0.0046335396392108525), (u'fund', 0.0039710016698683778), (u'market', 0.003874307507573083)]

pawel.g.d...@gmail.com

unread,
Jan 4, 2016, 4:28:06 PM1/4/16
to gensim
I mean, "preprocessing of German Wikipedia", of course.

Christopher S. Corley

unread,
Jan 9, 2016, 2:08:13 PM1/9/16
to gensim
Did you ever manage to figure out what was wrong with this?  I'm not too familiar with building models with German or what good choices in corpus construction would be.

pawel.g.d...@gmail.com

unread,
Jan 10, 2016, 4:56:01 AM1/10/16
to gensim
It seems the issue was in preprocessing. Refining some of the filters, along with my stopwords and goodwords lists did the trick! The option to install gensim in develop mode was very helpful here. Anyhow, the quality of the model now appears to be on par with the English one, so the German Wiki seems to be an appropriate corpus (to begin with anyway).

Thanks for all the help and suggestions!
Reply all
Reply to author
Forward
0 new messages