Wikipedia on fbreader - possible?

16 views
Skip to first unread message

Chris K

unread,
Jan 22, 2009, 12:26:27 PM1/22/09
to FBReader
Hello there.

I've downloaded abridged copies of wikipedia in tomeraider format to
browse on my laptop or PDA, which is a really cool thing. From the
eeePC community there seems to be some interest in an equivalent
version of wikipedia in a format that could be accessible on a linux
machine, but there don't seem to be tomeraider clients for linux :(

I've found a tutorial showing the original wikipedia XML to TR
conversion process at http://infodisiac.com/Wikipedia/ProcedureTR3.html
, and it seems like there's nothing really special about tomeraider
there, it's just a format that can handle a huge batch of indexed and
crossreferenced HTML, with compression. So - with a few tweaks, could
this same approach generate a version of wikipedia that could be
easily read with fbreader? With any decent level of compression? What
formats might be the best ones to try?

Thanks in advance for considering my question.

sking

unread,
Jan 22, 2009, 4:17:42 PM1/22/09
to FBReader


On Jan 22, 12:26 pm, Chris K <chrisk...@gmail.com> wrote:
> Hello there.
>
> I've downloaded abridged copies of wikipedia in tomeraider format to
> browse on my laptop or PDA, which is a really cool thing. From the
> eeePC community there seems to be some interest in an equivalent
> version of wikipedia in a format that could be accessible on a linux
> machine, but there don't seem to be tomeraider clients for linux :(
>
> I've found a tutorial showing the original wikipedia XML to TR
> conversion process athttp://infodisiac.com/Wikipedia/ProcedureTR3.html
> , and it seems like there's nothing really special about tomeraider
> there, it's just a format that can handle a huge batch of indexed and
> crossreferenced HTML, with compression. So - with a few tweaks, could
> this same approach generate a version of wikipedia that could be
> easily read with fbreader? With any decent level of compression? What
> formats might be the best ones to try?
>
> Thanks in advance for considering my question.

Not only would it be possible, it wouldn't even be terribly
complicated. The page at <http://en.wikipedia.org/wiki/
Wikipedia:Database_download> tells how to download the entire
database. I would go for the "pages-articles.xml" format, which is
the text of the articles without the talk or user pages. The current
version is 4.1GB, compressed. You may want to use just a subset of
that... Within the XML the articles are stored as plain-text Wiki
markup. It's a fairly simple matter to parse that and convert it to
HTML or some other format.

Ideally I'd convert this to Open eBook format <http://
www.openebook.org/>. FBreader's support for that format isn't very
extensive, though. For use specifically with FBreader you'd be better
off converting to FBreader's native format, FictionBook <http://
www.fictionbook.org/index.php/Eng:FictionBook_description>. Then it's
just a matter of converting the Wiki syntax for headings, links, and
other display elements into their FictionBook equivalents.

So the theory is fairly simple, but you'd better start with a really
good grasp of Wiki markup and of the FictionBook specs. Then it's just
a matter of writing the converter and waiting for it to grind through
the entire data set. I'd expect the final compressed size to be
roughly on par with the compressed size of the original data.

Now, whether or not FBreader can handle a file that large without
completely choking on it, *that's* another question!

AlanW

unread,
Jan 23, 2009, 3:07:03 AM1/23/09
to FBReader
FBReader currently reads the entire ebook into memory, so a single
file approach won't work.

An approach that might work would be to convert each article into a
MOBI (or FB2 or ePub) and then have an index document with links to
each article. The links would be external to the ebook and so would
invoke FBReader's new download capability. The files could actually
be remote, or they could be local if the download interface handles
this (or could be made to handle this).

However, wikipedia is web based and the simplest way to get them
locally on a Linux device is to use a web browser and a local copy of
the wiki. This has actually been done for the iRex iLiad, which is a
very resource-poor device. If it can work on the iLiad it can
definitely work on any Linux Desktop. See http://code.google.com/p/pyoffwiki/
Reply all
Reply to author
Forward
0 new messages