Importing Wikipedia Dumps

305 views
Skip to first unread message

Richard Smith

unread,
Jan 7, 2015, 1:07:58 AM1/7/15
to tiddl...@googlegroups.com
Hi All. I'm wondering if anybody has tried importing (largish amounts of) wikipedia data into TiddlyWiki?

I can use BJ's excellent TiddlyClip to import individual pages but I wonder if there's a way to get larger chunks of wikipedia?

It's possible to download offline dumps of various Wikipedia projects (http://en.wikipedia.org/wiki/Wikipedia:Database_download) but I'm not sure what the best format would be for getting the stuff into TW in a nice clean way. Any ideas?

Regards,
Richard

Richard Smith

unread,
Jan 7, 2015, 1:14:20 AM1/7/15
to tiddl...@googlegroups.com
OK. That was a bit lazy. I searched after I posted and found a similar recent thread. https://groups.google.com/forum/#!searchin/tiddlywiki/wikipedia/tiddlywiki/BefZrA4BpqQ/-XLsXOaav5wJ

I'll contextualise my question a little better.

I have recently been corresponding with someone who works for the Wikipedia "Offline Content Generator" project (http://www.mediawiki.org/wiki/Offline_content_generator) and I want to ask him if it's possible to add a widget/filter (?) to get content in a TiddlyWiki-friendly format. What should I ask for? :)

Regards,
Richard

PMario

unread,
Jan 7, 2015, 3:37:06 AM1/7/15
to tiddl...@googlegroups.com
There is a possibility to download Wikipedia databases [1] in XML format, which needs some post processing, to be used with TW. ...
BUT the problem here is size:
These files expand to multiple terabytes of text. Please only download these if you know you can cope with this quantity of data. Go to Latest Dumps and look out for all the files that have 'pages-meta-history' in their name.
So IMO this is a no go!

-----------

I think, the most promising way is the wikipedia api: http://www.mediawiki.org/wiki/API
or export: http://en.wikipedia.org/wiki/Special:Export
or http://en.wikipedia.org/wiki/Wikipedia:Creating_a_bot#APIs_for_bots

Special:Export seems to use XML only, so imo the wiki/API and API for bots is an option here.

If Special:Export could create CVS or JSON it could be directly used by TW, with drang and drop import. ... but ...

There is still a syntax problem. MediaWiki syntax is completely different to the TW syntax. ... With an export / import mechanism, you'll need to also export / import the "meta structure" eg: tags, fields and may be relations ...

So imo it can be done, for a limited amount of data but there is some work, that needs to be done.

---------------

The TW json format for 2 tiddlers would look like this:

[
    {
        "created": "20150107082527588",
        "text": "some text for tiddler 1 with an internal link to [[tiddler 2]]",
        "title": "tiddler 1",
        "tags": "tag1 tag2",
        "modified": "20150107082619778",
        "field1": "some text for field1"
    },
    {
        "created": "20150107082624952",
        "text": "some text for tiddler 2 with an internal link to [[tiddler 1]]",
        "title": "tiddler 2",
        "tags": "tag1 tag2",
        "modified": "20150107082705968",
        "field1": "some more text for field1",
        "field2": "text for field2"
    }
]

A file, that contains this text can be directly imported to TW out of the box.

----------

CVS format looks like this:

"title","text","modified","created","field1","field2","tags"
"tiddler 1","some text for tiddler 1 with an internal link to [[tiddler 2]]","20150107082619778","20150107082527588","some text for field1","","tag1 tag2"
"tiddler 2","some text for tiddler 2 with an internal link to [[tiddler 1]]","20150107082705968","20150107082624952","some more text for field1","text for field2","tag1 tag2"

I don't know, how to import this file.

-------------

Important:
 - Tiddler fields are dynamic.
 - Number of Tags is dynamic
 - TW Wiki Syntax is completely different, so some conversion would need to be done.

I hope that helps.

have fun!
mario

[1] http://en.m.wikipedia.org/wiki/Wikipedia:Database_download

PMario

unread,
Jan 7, 2015, 3:44:40 AM1/7/15
to tiddl...@googlegroups.com
tiddlers.json
I can't uplaod tiddlers.csv :/
-m
tiddlers.json

PMario

unread,
Jan 7, 2015, 3:45:25 AM1/7/15
to tiddl...@googlegroups.com
tiddlers.csv
tiddlers.csv

Tobias Beer

unread,
Jan 7, 2015, 4:18:17 AM1/7/15
to tiddl...@googlegroups.com
Hi Richard,
 
importing (largish amounts of) wikipedia data into TiddlyWiki

The first response that pops off my synapses reading such a proposal is:
Why? and again: Why?

As Mario hints, you will need...
  • field mapping
    • for importing
  • content conversion
    • formatting adaptors in TW5 to show WikiMedia style markup
I don't think there's anything from within tw5 yet and...
the idea doesn't sounds utterly compelling, to me.

A Wikipedia import for individual pages, ok, but large dumps?

...if it's possible to add a widget/filter (?) 
to get content in a TiddlyWiki-friendly format. 
What should I ask for? :)

It would be terrific if someone created such a thing, but you gotta wonder though:
Why would they ever want to invest so much effort into that?

So, what's the actual incentive / goal you have in mind other than
"because it would be cool" for TiddlyWiki?

What do you want to do with all this "stuff" in TiddlyWiki?

Best wishes, Tobias.

Andreas Hahn

unread,
Jan 7, 2015, 11:28:56 AM1/7/15
to tiddl...@googlegroups.com
Am 07.01.2015 um 10:18 schrieb Tobias Beer:
So, what's the actual incentive / goal you have in mind other than
"because it would be cool" for TiddlyWiki?

What do you want to do with all this "stuff" in TiddlyWiki?

Well, I can think of several reasons:

- To take ownership of the information you need. (i.e. have them on YOUR computer where they belong to YOU, this is a core idea behind TW)
- To have them readily available in the format you work with.
- To fully incorporate a specific piece of information in a specific revision into your wiki without relying on external sources (which may and will change over time).
- To search them << obviously

For most TW users, it will be obvious that the amount of data we are talking about here makes it unfeasable to actually "Import" the data into a TiddlyWiki. Instead I imagine that,  after you converted the data into a TW-like format, you would need:

- A suitable storage WITH a TiddlyWeb interface on top of it (this is comparatively easy actually).
- A suitable lazy loading mechanism within your client TW.
- A proper search mechanism, since the default TW one will not do the job.

/Andreas

Tobias Beer

unread,
Jan 7, 2015, 2:08:53 PM1/7/15
to tiddl...@googlegroups.com
For a simple offline Wikipedia, perhaps use:


To take ownership of the information you need. (i.e. have them on YOUR computer where they belong to YOU, this is a core idea behind TW)

I wouldn't think of it as ownership. More like a copy, perhaps an offline backup.

To have them readily available in the format you work with.

kiwix will give you that, I think ...there also is that android app if you wanted it
 
To fully incorporate a specific piece of information in a specific revision into your wiki without relying on external sources (which may and will change over time).

I believe, you can access and refer to specific revisions on wikipedia, you don't need to point to the latest revision or have to make a copy of it.

To search them

Wikipedia is quite good at that. Of course, always on the latest content, which makes sense to me. Kiwix works too..

For most TW users, it will be obvious that the amount of data we are talking about here makes it unfeasable to actually "Import" the data into a TiddlyWiki.

For everyone, really. A standalone TiddlyWiki is clearly not designed for that.
 
Instead I imagine that, after you converted the data into a TW-like format...

Ok, so that's the thing Richard is proposing... some (commandline) interface that primarly does precisely that.

A suitable storage WITH a TiddlyWeb interface on top of it (this is comparatively easy actually).

In terms of getting to show anything, mostly flat, sure... in terms of getting the relations and intricacies of fields and templates as on WikiPedia, that's not at all "easy".
 
A suitable lazy loading mechanism within your client TW.  
A proper search mechanism, since the default TW one will not do the job.

I guess, both of these go hand in hand. So, it would be some yet to be invented server-side search and indexing module doing the heavy lifting.

I am still keen to hear of a compelling reason to do all that for large dumps of WikiPedia, rather than individual articles.

Best wishes, Tobias. 

Richard Smith

unread,
Jan 8, 2015, 4:16:40 AM1/8/15
to tiddl...@googlegroups.com

The first response that pops off my synapses reading such a proposal is:
Why? and again: Why?


Hi Tobias,

My goal is to package information to be accessible completely offline (in an environment where there is no possibility of an internet connection). I would like to curate a collection which is suitable for a target audience of young children and am also interested in taking foreign language materials and using them as a a starting-point for a personal learning wiki.

The fact that TW is 'stand-alone' is only one of the great things about it. The thing I like the most is that it's so easy to (re-)compose content (and also build custom UI) , which is why I find it interesting to use it as a container for largish data-sets.

One possibility, I guess, would be to adapt the idea of TiddlyClip so that it can be given a list of pages and then fetch the content for all of them in a systematic manner.

Thanks (all) for the useful suggestions

Regards,
Richard




RichShumaker

unread,
Jan 9, 2015, 12:11:21 AM1/9/15
to tiddl...@googlegroups.com
I agree with Tobias on why repeat the repeating if something exists already and can work.
I can also say that on more than one occasion when asked why did you do that?
My response has been 'because I can.'

I have worked with HTML TW5 dealing with larger data sets and have not enjoyed it.
So from personal experience I would avoid it.
Although I am about to explore using node.js and seeing how I can break my browser.
You see TW is broken if you get a red box of death or some other error with the code.
When you put a 55mb PDF in a single Tiddler and view it and the browser chokes, that's the browser.

With all of that said, I can honestly say, I would like to use TiddlyMap(? - New TaskGraph name) with the Wikipedia to visually navigate the data in a different way.
See relationships that I may have missed when they were words.
Also it would be interesting to link my current data set into the Wikipedia data sets.
Obviously you have an area of interest that may say something like this on Wikipedia
TiddlyWiki is an open-source single page application wiki. A single HTML file contains CSS, JavaScript, and the content. The content is divided into a series of components, or Tiddlers. A user is encouraged to read a TiddlyWiki by following links rather than sequentially scrolling down the page.
and you want to say more but still link into what may already exist.
Hey that just struck a cord.  Why don't we re-write what we think TiddlyWiki is and then press go on having it at Wikipedia(until someone changes it).
I read the description above from Wikipedia that I posted above and said 'true but lacking'.

Okay back on track from my point of view I agree with Tobias and also see why someone would want this, I would use it if it was made.

Rich Shumaker
Reply all
Reply to author
Forward
0 new messages