More ideas on the mediawiki

jamesmi...@googlemail.com

unread,

Oct 26, 2009, 12:56:11 PM10/26/09

to mediawiki-vcs

http://undeletewikipedia.blogspot.com/2009/10/more-work-on-wikipedia.html

More work on the wikipedia
I am going to post my thoughts and ideas about the wikipedia to this blog.

First of all, I would like to propose my idea of a large number of
processors for the wikipedia.
Lets say that every time you edit and article, or look at one, you can
do some local processing of it if you are able.

Rendering from wikisyntax to html,
Link checking,
Semantic tagging,
Blame processing of the history.
indexing of the text,
translation to other languages.

there are many applications that could be run on the wiki pages,
but often we are lacking the processing power to do so.

All this would be shared via a peer to peer, distributed network of data.
That would be the mediawiki-vcs.

We would not need a central server, not really. We would first need a
set of computers that are used to process a subset of the wiki.

Avery Pennarun

unread,

Oct 26, 2009, 1:29:35 PM10/26/09

to mediaw...@googlegroups.com

On Mon, Oct 26, 2009 at 12:56 PM, jamesmi...@googlemail.com
<jamesmi...@googlemail.com> wrote:
> First of all, I would like to propose my idea of a large number of
> processors for the wikipedia.
> Lets say that every time you edit and article, or look at one, you can
> do some local processing of it if you are able.

Hey James and all,

I've been lurking here for a few days and I'm seeing lots of ideas,
but it seems to me like way too many things to bite off all at once.
For example, this idea of "local processing" of wikipedia requires you
to first have wikipedia data distributed across a shared network, and
I'm pretty sure we're nowhere near that yet.

My basic questions are:

A) What is the smallest possible unit of work we can do in order to
provide something useful that people will want to use?

B) Who are the people in favour of this work, who are opposed, and
what are the most common objections?

I'd like to help out with this project if I can. I've done a fair bit
of work with git, and especially using it for weird things, eg.
http://alumnit.ca/~apenwarr/log/?m=200901#21. However, as with any
project, we need to start with something simple and work our way up.

My suggested first steps would be:

1. Try importing the entire wikipedia history into git and see what
explodes. With that much data, there will surely be explosions,
probably several of them. We can then discuss and experiment with
potential solutions. (First question: is there a wikipedia data dump
available somewhere that we can try this against?)

2. Try making a copy of mediawiki that uses git instead of a SQL
database. Initially, this won't be nearly scalable enough to handle
wikipedia, but we'd learn a lot just by running some small wikis based
on this.

Have fun,

Avery

jamesmi...@googlemail.com

unread,

Oct 26, 2009, 1:43:06 PM10/26/09

to mediaw...@googlegroups.com

On Mon, Oct 26, 2009 at 6:29 PM, Avery Pennarun <apen...@gmail.com> wrote:
>
> On Mon, Oct 26, 2009 at 12:56 PM, jamesmi...@googlemail.com
> <jamesmi...@googlemail.com> wrote:
>> First of all, I would like to propose my idea of a large number of
>> processors for the wikipedia.
>> Lets say that every time you edit and article, or look at one, you can
>> do some local processing of it if you are able.
>
> Hey James and all,
>
> I've been lurking here for a few days and I'm seeing lots of ideas,
> but it seems to me like way too many things to bite off all at once.
> For example, this idea of "local processing" of wikipedia requires you
> to first have wikipedia data distributed across a shared network, and
> I'm pretty sure we're nowhere near that yet.
>
> My basic questions are:
>
> A) What is the smallest possible unit of work we can do in order to
> provide something useful that people will want to use?

Well, the on the most simple basis we need a way to display the
current version of the article as html,
for some reason github did not do that for the html i check in.

>
> B) Who are the people in favour of this work, who are opposed, and
> what are the most common objections?

I dont know yet, but the common objective is clear, to make something
usable and practical.
Ideally a user should be able to get started asap with setting up a
local distributed wikipedia fork.

>
> I'd like to help out with this project if I can. I've done a fair bit
> of work with git, and especially using it for weird things, eg.
> http://alumnit.ca/~apenwarr/log/?m=200901#21. However, as with any
> project, we need to start with something simple and work our way up.

I will look into that soon.

>
> My suggested first steps would be:
>
> 1. Try importing the entire wikipedia history into git and see what
> explodes. With that much data, there will surely be explosions,
> probably several of them. We can then discuss and experiment with
> potential solutions. (First question: is there a wikipedia data dump
> available somewhere that we can try this against?)

I have dumps and all that, I have been experimenting with two articles
in the github, and the local processing describes what i have done
with them.

Also, we will never need the full wikipedia, always a subset. We can
pick a catagory to start with, but I think that it
is unreasonable to want to mirror the whole thing.

> 2. Try making a copy of mediawiki that uses git instead of a SQL
> database. Initially, this won't be nearly scalable enough to handle
> wikipedia, but we'd learn a lot just by running some small wikis based
> on this.

Yes i agree.

mike

Avery Pennarun

unread,

Oct 26, 2009, 2:05:46 PM10/26/09

to mediaw...@googlegroups.com

On Mon, Oct 26, 2009 at 1:43 PM, jamesmi...@googlemail.com
<jamesmi...@googlemail.com> wrote:
> On Mon, Oct 26, 2009 at 6:29 PM, Avery Pennarun <apen...@gmail.com> wrote:
>> A) What is the smallest possible unit of work we can do in order to
>> provide something useful that people will want to use?
>
> Well, the on the most simple basis we need a way to display the
> current version of the article as html,
> for some reason github did not do that for the html i check in.

Well, surely mediawiki already does *that* part, right? What we
really need is the ability for mediawiki to read the wikitext from the
git repo instead of SQL, and then do what it normally does.

I'm not sure how github is related to this. Also, checking in the
html output doesn't seem wise; if you then change the formatting that
mediawiki uses to parse wikitext, you'll have to modify all the
checked-in html files. Those html files could be regenerated at any
time from the wikitext, right?

>> B) Who are the people in favour of this work, who are opposed, and
>> what are the most common objections?
>
> I dont know yet, but the common objective is clear, to make something
> usable and practical.
> Ideally a user should be able to get started asap with setting up a
> local distributed wikipedia fork.

What will they do with it?

> Also, we will never need the full wikipedia, always a subset. We can
> pick a catagory to start with, but I think that it
> is unreasonable to want to mirror the whole thing.

The problem is that with git (and probably many other systems),
pulling a subset out of the VCS can be a bit complicated. And the
easiest way to get a subset of the history is to have the full history
somewhere. So limiting ourselves to just a "subset" initially
(without clearly defining the process for finding that subset) is not
a helpful simplifying assumption. It's an assumption, but it doesn't
make things simpler.

Have fun,

Avery

jamesmi...@googlemail.com

unread,

Oct 26, 2009, 5:51:06 PM10/26/09

to mediaw...@googlegroups.com

On Mon, Oct 26, 2009 at 7:05 PM, Avery Pennarun <apen...@gmail.com> wrote:
>
> On Mon, Oct 26, 2009 at 1:43 PM, jamesmi...@googlemail.com
> <jamesmi...@googlemail.com> wrote:
>> On Mon, Oct 26, 2009 at 6:29 PM, Avery Pennarun <apen...@gmail.com> wrote:
>>> A) What is the smallest possible unit of work we can do in order to
>>> provide something useful that people will want to use?
>>
>> Well, the on the most simple basis we need a way to display the
>> current version of the article as html,
>> for some reason github did not do that for the html i check in.
>
> Well, surely mediawiki already does *that* part, right? What we
> really need is the ability for mediawiki to read the wikitext from the
> git repo instead of SQL, and then do what it normally does.

That should be very very simple.

>
> I'm not sure how github is related to this.

http://github.com/h4ck3rm1k3/KosovoWikipedia/
you can see my experiments there.

> Also, checking in the
> html output doesn't seem wise; if you then change the formatting that
> mediawiki uses to parse wikitext, you'll have to modify all the
> checked-in html files. Those html files could be regenerated at any
> time from the wikitext, right?

yes, well, we dont need to pre-render them.
My idea is to do that, but it should be optional.

>
>>> B) Who are the people in favour of this work, who are opposed, and
>>> what are the most common objections?
>>
>> I dont know yet, but the common objective is clear, to make something
>> usable and practical.
>> Ideally a user should be able to get started asap with setting up a
>> local distributed wikipedia fork.
>
> What will they do with it?

The should be able to edit the articles , publish them, and also
collaborate on them.

>
>> Also, we will never need the full wikipedia, always a subset. We can
>> pick a catagory to start with, but I think that it
>> is unreasonable to want to mirror the whole thing.
>
> The problem is that with git (and probably many other systems),
> pulling a subset out of the VCS can be a bit complicated.

We have two subsets : the subset of articles , and the subset of revisions.

The articles can be selected by branch name, we can have one branch
per article or group of articels in a context.
The revisions can be fetched per depth.

> And the
> easiest way to get a subset of the history is to have the full history
> somewhere. So limiting ourselves to just a "subset" initially
> (without clearly defining the process for finding that subset) is not
> a helpful simplifying assumption. It's an assumption, but it doesn't
> make things simpler.

I have no intention of processing the full wikipedia. It is too huge.
My vision is that groups of people will take care of different
branches that interest them.
We dont need a single repository that contains the entire wikipedia,
but many of them..

Eventually we will find hosts for them all, but in the first step, I
would like to be able to use this tool for contested articles that in
the middle of edit wars as a way to resolve conflicts.

I made some videos on all these topics, if you are interested.

Thanks for your input, I hope to start working on the mediawiki side soon.

mike

Avery Pennarun

unread,

Oct 26, 2009, 6:22:16 PM10/26/09

to mediaw...@googlegroups.com

On Mon, Oct 26, 2009 at 5:51 PM, jamesmi...@googlemail.com
<jamesmi...@googlemail.com> wrote:
> On Mon, Oct 26, 2009 at 7:05 PM, Avery Pennarun <apen...@gmail.com> wrote:
>> Well, surely mediawiki already does *that* part, right? What we
>> really need is the ability for mediawiki to read the wikitext from the
>> git repo instead of SQL, and then do what it normally does.
>
> That should be very very simple.

Excellent :) Do you know how to do it? It's been a long time since I
hacked on mediawiki.

>> I'm not sure how github is related to this.
>
> http://github.com/h4ck3rm1k3/KosovoWikipedia/
> you can see my experiments there.

Ah, okay. I'd definitely say that checking in your interim and
preprocessed files will cause some confusion here, particularly when
looking at the full project history, etc.

>> Also, checking in the
>> html output doesn't seem wise; if you then change the formatting that
>> mediawiki uses to parse wikitext, you'll have to modify all the
>> checked-in html files. Those html files could be regenerated at any
>> time from the wikitext, right?
>
> yes, well, we dont need to pre-render them.
> My idea is to do that, but it should be optional.

I've seen pre-rendering done before, though usually it's in a separate
branch. That way you can just delete that branch and start over when
the main history changes, and you don't clutter up the main history
with tons of temporary, easily-regenerated files. This is important
to make git (and presumably other VCSes) operate efficiently.

>>>> B) Who are the people in favour of this work, who are opposed, and
>>>> what are the most common objections?
>>>
>>> I dont know yet, but the common objective is clear, to make something
>>> usable and practical.
>>> Ideally a user should be able to get started asap with setting up a
>>> local distributed wikipedia fork.
>>
>> What will they do with it?
>
> The should be able to edit the articles , publish them, and also
> collaborate on them.

Sure... but why do they want that? Why can't they just use the
existing wikipedia for this?

I'd like to understand what *doesn't* work well right now, since that
should logically be the thing to work on first.

>>> Also, we will never need the full wikipedia, always a subset. We can
>>> pick a catagory to start with, but I think that it
>>> is unreasonable to want to mirror the whole thing.
>>
>> The problem is that with git (and probably many other systems),
>> pulling a subset out of the VCS can be a bit complicated.
> We have two subsets : the subset of articles , and the subset of revisions.
>
> The articles can be selected by branch name, we can have one branch
> per article or group of articels in a context.
> The revisions can be fetched per depth.

I expect this will be made a little more complicated by the fact that
when you take a subset of articles, it'll be hard to figure out which
hyperlinks (eg. across categories) are valid and which are linking to
truly nonexistent articles (vs. just not in your subset repository).

> I have no intention of processing the full wikipedia. It is too huge.

How huge? What is the size of the most recent version of every page,
all tarred and gzipped?

> Eventually we will find hosts for them all, but in the first step, I
> would like to be able to use this tool for contested articles that in
> the middle of edit wars as a way to resolve conflicts.

If an article is in the middle of an edit war, how does a distributed
VCS help? It sounds like distributed-disconnected operation would
just result in *more* conflicts.

> I made some videos on all these topics, if you are interested.

I prefer to avoid videos since they take a lot longer than reading an
email. They might be valuable to someone else, though.

Have fun,

Avery

jamesmi...@googlemail.com

unread,

Oct 26, 2009, 7:43:54 PM10/26/09

to mediaw...@googlegroups.com

On Mon, Oct 26, 2009 at 11:22 PM, Avery Pennarun <apen...@gmail.com> wrote:

> Sure... but why do they want that? Why can't they just use the
> existing wikipedia for this?
>

It is pretty simple, because of NPOV and other issues.

Reply all

Reply to author

Forward