On Jan 22, 12:26 pm, Chris K <
chrisk...@gmail.com> wrote:
> Hello there.
>
> I've downloaded abridged copies of wikipedia in tomeraider format to
> browse on my laptop or PDA, which is a really cool thing. From the
> eeePC community there seems to be some interest in an equivalent
> version of wikipedia in a format that could be accessible on a linux
> machine, but there don't seem to be tomeraider clients for linux :(
>
> I've found a tutorial showing the original wikipedia XML to TR
> conversion process athttp://
infodisiac.com/Wikipedia/ProcedureTR3.html
> , and it seems like there's nothing really special about tomeraider
> there, it's just a format that can handle a huge batch of indexed and
> crossreferenced HTML, with compression. So - with a few tweaks, could
> this same approach generate a version of wikipedia that could be
> easily read with fbreader? With any decent level of compression? What
> formats might be the best ones to try?
>
> Thanks in advance for considering my question.
Not only would it be possible, it wouldn't even be terribly
complicated. The page at <
http://en.wikipedia.org/wiki/
Wikipedia:Database_download> tells how to download the entire
database. I would go for the "pages-articles.xml" format, which is
the text of the articles without the talk or user pages. The current
version is 4.1GB, compressed. You may want to use just a subset of
that... Within the XML the articles are stored as plain-text Wiki
markup. It's a fairly simple matter to parse that and convert it to
HTML or some other format.
Ideally I'd convert this to Open eBook format <http://
www.openebook.org/>. FBreader's support for that format isn't very
extensive, though. For use specifically with FBreader you'd be better
off converting to FBreader's native format, FictionBook <http://
www.fictionbook.org/index.php/Eng:FictionBook_description>. Then it's
just a matter of converting the Wiki syntax for headings, links, and
other display elements into their FictionBook equivalents.
So the theory is fairly simple, but you'd better start with a really
good grasp of Wiki markup and of the FictionBook specs. Then it's just
a matter of writing the converter and waiting for it to grind through
the entire data set. I'd expect the final compressed size to be
roughly on par with the compressed size of the original data.
Now, whether or not FBreader can handle a file that large without
completely choking on it, *that's* another question!