Making dictionaries from the .zim official HTML dumps

35 views
Skip to first unread message

pko...@gmail.com

unread,
Nov 20, 2020, 6:28:16 AM11/20/20
to aarddict

Hello, I am an Aard2 user with a little python knowledge. I heard that making a .slob involves downloading all pages via the MediaWiki API and an intermediate CouchDB storage, and that this takes a lot of time. Why not just download the .zim package of the official HTML dumps [1] and repack into a .slob with libzim [2]? This works on my Debian 10:

import libzim.reader, slob
zim_file = libzim.reader.File("html_dump.zim")
print(zim_file.article_count)
with slob.create("output.slob") as slob_file:
    for i in range(zim_file.article_count):
        article = zim_file.get_article_by_id(i)
        slob_file.add(article.content, article.title)

franc

unread,
Nov 20, 2020, 10:00:13 AM11/20/20
to aarddict
pko...@gmail.com schrieb am Freitag, 20. November 2020 um 12:28:16 UTC+1:

Hello, I am an Aard2 user with a little python knowledge. I heard that making a .slob involves downloading all pages via the MediaWiki API and an intermediate CouchDB storage, and that this takes a lot of time. Why not just download the .zim package of the official HTML dumps [1] and repack into a .slob with libzim [2]? This works on my Debian 10:

Klings interestingly!
I had a quick look to the equivalent Wikipedia german (full articles), which is quite actual from 11-2020, this has as zim-file 13 GB. As slob it has only about 5.4 GB!
So did you try to convert a whole zim-Wikipedia (without pictures obviously) already? Is it as small as the direct produced slob?

I tried zim reader and files in the past, but at this time the zim-files were mostly old, so I spent days to download my own wiki as zim (with a tool called MWoffliner) and in the end the zim-reader was far not that good as aard2, so I stopped that.
But if these zim-files are now produced in always new versions, it might be a faster way to get actual slobs.



pk

unread,
Dec 31, 2020, 8:34:49 AM12/31/20
to aard...@googlegroups.com
Il giorno ven 20 nov 2020 alle ore 16:00 franc <franc...@gmail.com>
ha scritto:
>
>
> pko...@gmail.com schrieb am Freitag, 20. November 2020 um 12:28:16 UTC+1:
>>
>>
>> Hello, I am an Aard2 user with a little python knowledge. I heard that making a .slob involves downloading all pages via the MediaWiki API and an intermediate CouchDB storage, and that this takes a lot of time. Why not just download the .zim package of the official HTML dumps [1] and repack into a .slob with libzim [2]? This works on my Debian 10:
>
>
> Klings interestingly!
> I had a quick look to the equivalent Wikipedia german (full articles), which is quite actual from 11-2020, this has as zim-file 13 GB. As slob it has only about 5.4 GB!
> So did you try to convert a whole zim-Wikipedia (without pictures obviously) already? Is it as small as the direct produced slob?
slob creation and compression depend only on the slob python module,
unrelated to zim or other article source, as the code above
decompresses the articles in RAM anyway. I did not try, but, if the
output slob is not as small, then that is an issue/bug of the slob
python module.

> I tried zim reader and files in the past, but at this time the zim-files were mostly old, so I spent days to download my own wiki as zim (with a tool called MWoffliner) and in the end the zim-reader was far not that good as aard2, so I stopped that.
> But if these zim-files are now produced in always new versions, it might be a faster way to get actual slobs.
They are generated monthly since a few years. Wikimedia (the
non-profit behind Wikipedia) uses them to distribute Wikipedia to the
developing countries.

franc

unread,
Dec 31, 2020, 9:06:03 AM12/31/20
to aarddict
pko...@gmail.com schrieb am Donnerstag, 31. Dezember 2020 um 14:34:49 UTC+1:
... . I did not try, but, if the
output slob is not as small, then that is an issue/bug of the slob
python module.  ...

Not as small? Must be a misunderstanding. The latest zim-file for the complete german dewiki without pictures (all + nopic) is 13 GB big.
The slob-file for the same wiki is just 5.6 GB small.
I cannot guarantee that the dewiki slob is really complete, but I use the dewiki since a while and quite often. I had never noticed a missing or shortened (as e.g. in the "nodet" zims) article.
So I don't know what is the huge size difference between zim and slob but I guess its the far better compression of slob.

The only advantage of zim-wikis could be the images, but I tried them once, the images are so small, that I didnt consider them as any helpful and mostly not with that huge zim-filesize. The all + maxi (50k images) dewiki zim has 38 GB!

So why dont you try to create a slob from a downloaded zim-file now, as posted in the first post?
This seems very interesting!

Reply all
Reply to author
Forward
0 new messages