New enwiki-popular

91 views
Skip to first unread message

timo.hors...@gmail.com

unread,
Dec 13, 2014, 9:49:35 AM12/13/14
to aard...@googlegroups.com
With the wikistats files [1] and some modification to Igor's slob tool, I managed to compile a few enwiki-popular dictionaries from MHBraun's complete enwiki-20141201.
There are now dictionaries with different number of titles ranging from 500 K to 2 M. The aim is to provide enwiki dictionaries even for devices with smaller memory, like many smartphones lacking a micro-sd slot.
Thus, the references and infoboxes have been removed to make the dictionaries even a bit more compact.

The files are located on my Mega account in the enwiki-popular folder:


The subfolder enwiki-popupar/wikisort contains a simple ipytohn notebook to generate a sorted list of popular articles out of the wikistats files, as well as the resulting wordlist (enwiki_sortet.txt) and my modified slob.py for [2].

Cheers,
Timo

itkach

unread,
Dec 13, 2014, 6:39:46 PM12/13/14
to aard...@googlegroups.com

On Saturday, December 13, 2014 9:49:35 AM UTC-5, timo.hors...@gmail.com wrote:
With the wikistats files [1] and some modification to Igor's slob tool, I managed to compile a few enwiki-popular dictionaries from MHBraun's complete enwiki-20141201.
There are now dictionaries with different number of titles ranging from 500 K to 2 M. The aim is to provide enwiki dictionaries even for devices with smaller memory, like many smartphones lacking a micro-sd slot.
Thus, the references and infoboxes have been removed to make the dictionaries even a bit more compact.

The files are located on my Mega account in the enwiki-popular folder:


The subfolder enwiki-popupar/wikisort contains a simple ipytohn notebook to generate a sorted list of popular articles out of the wikistats files, as well as the resulting wordlist (enwiki_sortet.txt) and my modified slob.py for [2].

Very useful, thank you.

If I may suggest - Github is a much better place for code than Mega. Anyone looking at https://github.com/itkach/slob would be able to discover your fork. It's also much easier to see differences. Same goes for for ipython notebooks - they are awesome, but looking at a simple script (and running) in version control repository is much easier.  

mhbraun

unread,
Dec 13, 2014, 9:11:40 PM12/13/14
to aard...@googlegroups.com
Great. Well done.
Reply all
Reply to author
Forward
0 new messages