Tech preview: new dictionary file format - slob

1,003 views
Skip to first unread message

itkach

unread,
Jan 19, 2014, 8:34:02 PM1/19/14
to aard...@googlegroups.com
I just pushed a first cut of a reference implementation for the new dictionary file format (called slob), check out documentation and source code at https://github.com/itkach/slob. Dictionaries in the new format are quite a bit smaller than current Aard Dictionary format

319 Mb enwiktionary-20110604.slob vs 910 Mb enwiktionary-20110604-2.html.aar
18 Mb wordnet-3.0.slob vs 41 Mb wordnet-3.0-1.aar
84 Mb simplewiki-20131030.slob vs 130 Mb simplewiki-20131030.aar

Slob is influenced by both Aard Dictionary format and OpenZIM and can store any content, not just text or html. Be sure to also checkout slobby - web-based user interface to explore content of slob files, it's very bare-bones, but does pretty much everything aarddict desktop does and can be accessed over the network.

Next I'll be looking into getting this deployed on Android and hooking it up with a mobile user interface.

Any feedback is much appreciated.

franc

unread,
Jan 20, 2014, 11:46:32 AM1/20/14
to aard...@googlegroups.com
Am Montag, 20. Januar 2014 02:34:02 UTC+1 schrieb itkach:
I just pushed a first cut of a reference implementation for the new dictionary file format (called slob), check out documentation and source code at https://github.com/itkach/slob. Dictionaries in the new format are quite a bit smaller than current Aard Dictionary format: ...

This is good news!
This sounds great.
 
...

Slob is influenced by both Aard Dictionary format and OpenZIM and can store any content, not just text or html. 
...

Is there also a plan how to download the actual wikipedias?
At the moment, I am downloading the german wikipedia for a zimfile creation for kiwix, through the undocumented mwoffliner.js, but I have to say, that I am quite disappointed.
The download is runnung already two days and takes maybe some days more(!!!), so it is not comparable with the former easy aard-dic-creation from Wiki-Dumps.
I would love to get a offline Wikipedia from the mobile version of wikipedia, but for kiwix this is not yet possible and I doubt, that I will follow this too troublesome way for an offline wiki anyway.
Maybe with your way, this would be possible?
I am very curious how you will do it, and I offer my capacities (a fast server with fast internet: 100 Mbs) to compile and host if neccessary.
And my little skills as well, if helpful :)
 
Next I'll be looking into getting this deployed on Android and hooking it up with a mobile user interface.


Sounds senseful to me, but I don't know it (Python for Android).

Thank you for this milestones!


itkach

unread,
Jan 20, 2014, 10:11:47 PM1/20/14
to aard...@googlegroups.com


On Monday, January 20, 2014 11:46:32 AM UTC-5, franc wrote:
Is there also a plan how to download the actual wikipedias?
At the moment, I am downloading the german wikipedia for a zimfile creation for kiwix, through the undocumented mwoffliner.js, but I have to say, that I am quite disappointed.
The download is runnung already two days and takes maybe some days more(!!!), so it is not comparable with the former easy aard-dic-creation from Wiki-Dumps.
I would love to get a offline Wikipedia from the mobile version of wikipedia, but for kiwix this is not yet possible and I doubt, that I will follow this too troublesome way for an offline wiki anyway.
Maybe with your way, this would be possible?

Yes, I plan to write a wikipedia downloader that would get rendered HTML (and possibly images, at least some of them) from Wikipedia itself via their web API. Such downloader is going to be necessarily slow, but downloaded documents can be stored in a document database such as CouchDB (which can be easily shared and/or replicated) and updated only when they are really modified, so initial download would take a while, but keeping it up to date would be (hopefully) relatively fast. Packing documents from a database like CouchDB into slob should be fairly straightforward.      
  
Next I'll be looking into getting this deployed on Android and hooking it up with a mobile user interface.


Sounds senseful to me, but I don't know it (Python for Android).

 The hope here is to run on Android the exact same code that works on desktop eliminating the need for separate code bases (for the most part, there's still going to be some Android specific stuff).


franc

unread,
Jan 22, 2014, 1:08:06 AM1/22/14
to aard...@googlegroups.com
Am Dienstag, 21. Januar 2014 04:11:47 UTC+1 schrieb itkach:
...
Yes, I plan to write a wikipedia downloader that would get rendered HTML (and possibly images, at least some of them) from Wikipedia itself via their web API. Such downloader is going to be necessarily slow, but downloaded documents can be stored in a document database such as CouchDB (which can be easily shared and/or replicated) and updated only when they are really modified, so initial download would take a while, but keeping it up to date would be (hopefully) relatively fast...

This is a difficult thing, I guess.
But if it is possible to really download only the changes, I guess the second download (a month later) could be done in few hours.
I look forward to it :)

Shorty66

unread,
Mar 17, 2014, 4:21:34 AM3/17/14
to aard...@googlegroups.com
Will slob support incremental updates to slob files or is this only planned for the raw data using CouchDB?
I think incremental updates are very important for users if large dictionaries (like wiki + images) are used.

It would also be nice to be able to exchange pictures in a slob. This way a slob with very low resolution thumbnails could be published. The slob-reader could than download higher resolution pictures and integrate them into the slob on demand.

Shorty66

unread,
Mar 17, 2014, 4:37:08 AM3/17/14
to aard...@googlegroups.com
This would also allow you to only publish wikis with placeholders instead of images. These placeholders could then be updated afterwards by the user. I imagine that the images could be stored in an Adam7 interlaced format and the application would stopp the download of each image after a certain resolution has been reached.

The application could then have an option to choose the maximum resolution and download the images. As the downloaded wikis only contain placeholders they would be similar in size to wikis without images.

itkach

unread,
Mar 17, 2014, 10:23:13 AM3/17/14
to aard...@googlegroups.com
On Monday, March 17, 2014 4:21:34 AM UTC-4, Shorty66 wrote:
Will slob support incremental updates to slob files or is this only planned for the raw data using CouchDB?
I think incremental updates are very important for users if large dictionaries (like wiki + images) are used.
 
I don't have such plans. Slob is a read only data store, not a database. Writeable stores are a lot more complex and often require some sort of maintenance - "compaction" or "vacuuming" to optimize space usage. Also, with the amount of edits Wikipedia gets, it is not clear that incremental updates are going to result in significant download time savings. The only reason I can think of this may be perceived as important is that currently dictionary downloads are not very reliable. As I say in another thread, maybe it's a good time to give torrents another try.

It would also be nice to be able to exchange pictures in a slob. This way a slob with very low resolution thumbnails could be published. The slob-reader could than download higher resolution pictures and integrate them into the slob on demand.

Perhaps. But then if you are online you can just view the article online.

itkach

unread,
Mar 17, 2014, 10:30:11 AM3/17/14
to aard...@googlegroups.com


On Monday, March 17, 2014 4:37:08 AM UTC-4, Shorty66 wrote:
This would also allow you to only publish wikis with placeholders instead of images. These placeholders could then be updated afterwards by the user. I imagine that the images could be stored in an Adam7 interlaced format and the application would stopp the download of each image after a certain resolution has been reached.

The application could then have an option to choose the maximum resolution and download the images. As the downloaded wikis only contain placeholders they would be similar in size to wikis without images.

Image elements generated by MediaWiki include image set tags that allow browser to select best image resolution from the image set for a particular device. Leaving image urls pointing to Wikipedia sites, or using offline thumbnail plus online image set urls will probably achieve a good compromise between images availability and dictionary size. I'm not at the point yet where I can experiment with this, but certainly something to look into.

Shorty66

unread,
Mar 18, 2014, 6:20:23 AM3/18/14
to aard...@googlegroups.com
Okay, i understand your objections regarding incremental updates.
If images are stored in multiple versions and only one requested version is downloaded anyway it would be nice to be able to set the resolution in the options.
Something like "Thumbs only" or a maximum resolution slider.
If the images are not stored directly in the .slob file, it would be easier to distribute the images and the slobs.

I think torrents are a pretty good option for distribution. You could also think about WebRTC for in app distribution. This way every user of the application would have the option to seed the wiki files directly from the app. In order to save costs this would only be allowed on wlan.
There are some WebRTC projects which replace content delivery networks with p2p exchanges.

see https://peercdn.com/
Reply all
Reply to author
Forward
0 new messages