Wikipedia with pictures

512 views
Skip to first unread message

mhb...@freenet.de

unread,
Nov 20, 2012, 3:21:08 PM11/20/12
to aard...@googlegroups.com
Did anybody try to compile a Wikipedia with pictures into aarddict?
I know, it's gonna be huge, but if the pictures are resized, i guess it could be packed on a 16GB card.

itkach

unread,
Nov 21, 2012, 10:16:45 AM11/21/12
to aard...@googlegroups.com
On Tuesday, November 20, 2012 3:21:08 PM UTC-5, mhbraun wrote:
Did anybody try to compile a Wikipedia with pictures into aarddict?  
I know, it's gonna be huge, but if the pictures are resized, i guess it could be packed on a 16GB card.

Short answer: no.

Long answer:

aarddict uses WebKit to display articles, so of course technically it is possible to include images. Math formulas, for example, are PNG images rendered during dictionary compilation and embedded into article HTML with http://en.wikipedia.org/wiki/Data_URI_scheme . Other images could be included the same way, with no changes required to current dictionary format (only dictionary compiler needs to change). This is not efficient storage-wise but easy to implement. 

The difficult part (other than increasing dictionary size and processing time significantly) was that images were not available for download in a single data dump like article texts are, and one of the reasons for that was that many images (and other media files) didn't have a clear license or license that would allow distribution. 

and image dumps are now available (here, for example: http://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/20121104/), so perhaps it's worth exploring, although I think expecting enwiki with images to fit 16Gb SD card is too optimistic, probably not going to happen.
 

mhbraun

unread,
Nov 25, 2012, 5:49:44 AM11/25/12
to aard...@googlegroups.com
Thanks for the
 
Long answer:

 
and for the links
 
example: http://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/20121104/), so perhaps it's worth exploring, although I think expecting enwiki with images to fit 16Gb SD card is too optimistic, probably not going to happen.
 
 
Looks huge. Approx 1,8 TB for local and external images.
I will download the first part of the local images (approx 40 GB) and will have a look if resizing the images for small displays makes sense and how much space would be needed. Will take at least a week to download.
 
As I am expecting different picture formats, does aarddict has a preference for the format? A conversion could be done on the fly as well.  

mhbraun

unread,
Nov 25, 2012, 5:55:14 AM11/25/12
to aard...@googlegroups.com

Am Sonntag, 25. November 2012 11:49:44 UTC+1 schrieb mhbraun:
 
Looks huge. Approx 1,8 TB for local and external images.
I will download the first part of the local images (approx 40 GB) and will have a look if resizing the images for small displays makes sense and how much space would be needed. Will take at least a week to download.
 
 
Talking about dewiki, not enwiki ;-)

franc

unread,
Jan 22, 2013, 6:15:51 AM1/22/13
to aard...@googlegroups.com
Am Sonntag, 25. November 2012 11:49:44 UTC+1 schrieb mhbraun:
...

Looks huge. Approx 1,8 TB for local and external images.
I will download the first part of the local images (approx 40 GB) and will have a look if resizing the images for small displays makes sense and how much space would be needed. Will take at least a week to download.

Some new experience about this enterprize?
Would be great if it was possible with at least little thumbnails for the articles, they shouldn't need that much space I guess/hope.

franc

itkach

unread,
Jan 22, 2013, 10:04:53 AM1/22/13
to aard...@googlegroups.com
Offline Wikipedia with pictures exists in .zim format - take a look at  http://kiwix.org. Unfortunately no usable .zim readers exist for android at the moment, as far as I know. I consider adding .zim to aarddict and eventually switching to .zim completely. This is in the planning stage though, it's hard to tell at the moment when (and if) .zim support will be available in aarddict. Stay tuned. 

franc

franc

unread,
Jan 22, 2013, 4:26:50 PM1/22/13
to aard...@googlegroups.com
Am Dienstag, 22. Januar 2013 16:04:53 UTC+1 schrieb itkach:
... 
Offline Wikipedia with pictures exists in .zim format - take a look at  http://kiwix.org. Unfortunately no usable .zim readers exist for android at the moment, as far as I know. I consider adding .zim to aarddict and eventually switching to .zim completely. This is in the planning stage though, it's hard to tell at the moment when (and if) .zim support will be available in aarddict. Stay tuned. 

Okay, thank you. 18 GB for a wiki with pictures is a big thing for a phone, though.
But I understand that in reality it is a good size, could be bigger, I guess.
Anyway, this is not possible for me, having only 32 GB SD and I need much more than 14 GB for all the rest, which is more important to me, than the pictures of wikipedia. I have to wait till I have a phone with 64 GB SDcard :)

frank

mhbraun

unread,
Jan 22, 2013, 5:37:00 PM1/22/13
to aard...@googlegroups.com


Some new experience about this enterprize?
Would be great if it was possible with at least little thumbnails for the articles, they shouldn't need that much space I guess/hope.


I finally made it to download the 40GB with my slow line last week. Had a brief look at the content of the package. the average size of the pictures are 230kB. Reducing all files bigger than 20kB to 20 kB (72dpi, width of 480 pixels) will theoretically lead to an offline dewiki of less than 16GB.

Need further investigation on this
- will just the online pictures be good enaugh?
- any mode to select pictures to reduce size of the final compilation?
- is the guess of 20 kB ok for the display, or is there another limit?
- any parameters to reduce the size of the vast amount of pictures with less than 20 kB applicable?

Some more brainworks need to be done in order to get an applicable strategy. On Palm Platform there was a Tomeraider Version of Wikipedia with pictures which fitted into 3 GB. I have no clue how they did this one.

   

itkach

unread,
Jan 22, 2013, 5:40:50 PM1/22/13
to aard...@googlegroups.com

On Palm Platform there was a Tomeraider Version of Wikipedia with pictures which fitted into 3 GB. I have no clue how they did this one.
    

Since you refer to ancient history they probably didn't have to do anything clever because Wikipedia was young and tiny :) 

Frieder Ferlemann

unread,
Jan 22, 2013, 5:57:58 PM1/22/13
to aard...@googlegroups.com
Hi,
Brainstorming:

Compression on a picture by picture basis might not be enough.

Maybe some free software exists which (pre)sorts wikipedia's
pictures by similarity and then uses one of the more
modern video compression algorithms?
(So adding yet another picture would "just" mean
having to deal with the delta to the best matching
existing one?
(and keeping an index and eventually sorting again?))

Greetings,
Frieder

mhbraun

unread,
Jan 23, 2013, 12:41:43 PM1/23/13
to aard...@googlegroups.com
Compression on a picture by picture basis might not be enough.

Maybe some free software exists which (pre)sorts wikipedia's
pictures by similarity and then uses one of the more
modern video compression algorithms?
(So adding yet another picture would "just" mean
having to deal with the delta to the best matching
existing one?
(and keeping an index and eventually sorting again?))
 
This is a nice approach. Will have some investigation about it. Capability must be included into the viewer of course as well.
 

itkach

unread,
Feb 7, 2013, 10:35:04 AM2/7/13
to aard...@googlegroups.com
There's Kiwix wikipedia reader that uses ZIM file format which appears to be adopted as Mediawiki's official format for offline content. Perhaps moving aarddict to ZIM format is the way to go, although as far as I can tell there are some difficulties in making it perform well enough on mobile (specifically Android) where it's most needed.

mhbraun

unread,
Feb 8, 2013, 1:28:37 PM2/8/13
to aard...@googlegroups.com

Did some tests and calculation.
Steps to be done:
Changing the size of the pictures to 480 pixel
- Reducing the DPI to 72 DPI
- Reducing all images > 15 kB to 15 kB and leaving the rest as it is
Will result in reasonable thumbnails and the dewiki should fit on a 16 GB SD-Card.

However I am not capable to download the picture dumps due to limited bandwith to give it a try...

For somebody who wants to give it a try:
I used XnViews batch mode to try runs reasonably fast. Important: You may keep the filenames if you compress to another directory.

Shorty66

unread,
Feb 14, 2013, 9:35:46 AM2/14/13
to aard...@googlegroups.com
I would love to be able to use the .zim format. Most smartphones are able to use sdxc cards if they are formatted to fat32 - dont believe the manufacturers regarding maximal microsd-capacity.

Regarding images: I think a combination of downsizing to typical smartphone resolutions (lets assume 800x480 here) and picking those pictures of most interest would be great. Isn`t there a statistic for the most viewed wikipedia pictures somewhere? If one could only have the 20% most viewed pictures in the mobile version that might be enough.
Also, you could limit the number of pictures per article or per paragraph to some number and only use the first 6 or so.

mhbraun

unread,
Feb 14, 2013, 11:06:20 AM2/14/13
to aard...@googlegroups.com


Am Donnerstag, 14. Februar 2013 15:35:46 UTC+1 schrieb Shorty66:
I would love to be able to use the .zim format. Most smartphones are able to use sdxc cards if they are formatted to fat32 - dont believe the manufacturers regarding maximal microsd-capacity.

I do not understand the correlation of the .zim format to the fat32 format. Aarddict actually has .aar format which runs on fat32, ntfs, ext4 as far as I know.
 
Regarding images: I think a combination of downsizing to typical smartphone resolutions (lets assume 800x480 here) and picking those pictures of most interest would be great. Isn`t there a statistic for the most viewed wikipedia pictures somewhere? If one could only have the 20% most viewed pictures in the mobile version that might be enough.
Also, you could limit the number of pictures per article or per paragraph to some number and only use the first 6 or so.

Well, the latter would be of course an option for me. However you need to download the whole bunch of pictures to select the pictures. I did not find a subset. For dewiki this is 1.8 TB of data. With my limited bandwith it would take months. And there is a new dump every months. The process just takes too long for me with the given resouces.

Another option would be to integrate just the link of the picture to the online Wikipedia. If you really need a picture, you may have a look at it online.

 

Shorty66

unread,
Feb 15, 2013, 11:07:30 AM2/15/13
to aard...@googlegroups.com
Would it be an option to create a crawler which automatically downloads only the needed images? Or is that against some wikipedia rules?

Regarding zim and fat 32: There is no correlation. I just wanted to state, that most modern smartphones are able to use big sdcx cards (64gb and more) even if the manufacturer doesnt say so. The SD-card just has to be formatted the right way to be able to use them in a smartphone, but android smartphones will offer to format them autmatically if they are formatted wrong.
As 64gb Microsd-cards are fairly cheap by now, i believe that many users might have enough storage space for the wikipedia with images in the future.

Shorty66

unread,
Feb 15, 2013, 11:26:13 AM2/15/13
to aard...@googlegroups.com
I searched the web for an autmatic image downloader and found this one, which dowenloads only the featured pictures. After resizing that should be a fairly small amount of data and could be used for testing.

I also found this post which explains the use of "wikix" to download all the pictures. As wikix generates batch scripts for the downloads, these scripts might be easy to adjust in a way that only certain needed pictures are downloaded.

mhbraun

unread,
Feb 15, 2013, 12:53:35 PM2/15/13
to aard...@googlegroups.com


Am Freitag, 15. Februar 2013 17:07:30 UTC+1 schrieb Shorty66:
Would it be an option to create a crawler which automatically downloads only the needed images? Or is that against some wikipedia rules?

I guess not. You may download the whole picture dump.
 
Regarding zim and fat 32: There is no correlation. I just wanted to state, that most modern smartphones are able to use big sdcx cards (64gb and more) even if the manufacturer doesnt say so. The SD-card just has to be formatted the right way to be able to use them in a smartphone, but android smartphones will offer to format them autmatically if they are formatted wrong.
As 64gb Microsd-cards are fairly cheap by now, i believe that many users might have enough storage space for the wikipedia with images in the future.

I own a Samsung i9001 which seems to be limited to 32 GB. How would you format the card in order the mobile can make use of the 64GB?
And of course you are right, with a 32 GBcard, it should not be a problem at all to have all pictures in the Wikipedia. That was my original intention. If I need a picture it is often a flag of a country or one of these nice descriptions about computer connections (see USB) or electrical plugs in different countries. Most of the rest I do not care. Everybody has different interests. Why not packing all pictures into the wikipedia offline?
My tests and calculations showed me that it should work for dewiki at least - under given parameters (see above).

Are you capable downloading the english wikipedia image dumps from  http://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/
and doing a similar test? You gonna need quite some download speed to get the data :-)

Shorty66

unread,
Feb 15, 2013, 1:37:53 PM2/15/13
to aard...@googlegroups.com
I did a quick search on your i9001 and "the internet" says, that it is able to use 64gb cards.

The moment you insert the sdxc 64gb card into your samsung i9001, you will be asked if you want to format the card.
It will then proceed automatically and the storage card should show up with its capacity in the settings menu.

Be warned, though: Any data you had previously on the card will be lost.

Shorty66

unread,
Feb 15, 2013, 1:40:51 PM2/15/13
to aard...@googlegroups.com
Regarding the tests: i got a 8000kbps line over here. Downloading 1.8 tB would take me about 21 days.
I could try at my university, though. They have a 1 GB/s connection there.
Reply all
Reply to author
Forward
0 new messages