Overview of the work done

16 views
Skip to first unread message

jamesmi...@googlemail.com

unread,
Oct 22, 2009, 2:37:18 AM10/22/09
to mediawiki-vcs

Additionally, I did not name this group mediawiki-git, but mediawiki-
vcs, for the reason that we
should support multiple back ends via a plug in. I am interested in
using git because i think git is great, but others should be free to
use cvs if they feel it is needed.

Here is a collection of links about this idea, wikipedia on git :

see the thread on foundation-l that spawned this group :
http://www.mail-archive.com/founda...@lists.wikimedia.org/msg08342.html

Here are some of my ideas from that thread :

1. Very few people will want to have all the data, think about getting
all the
versions from all the git repositories, it would be the same.
My idea is for smaller chapters who want to get started easily, or
towns, regions to host their own branches of relevant data.
Given a world full of such servers, the sum would be great but the
individual branches needed at one time would be small.

2. . I am using git because I think it is the best way
forward to implement many of the ideas discussed in the strategy wiki.

3. if you want only the last 3 revisions checked out , 1.258s produces
252K of data.
git clone --depth 3 git://github.com/h4ck3rm1k3/KosovoWikipedia.git


Here are two video messages I created:
http://www.youtube.com/watch?v=jc9jo1ZFLqk

http://www.youtube.com/watch?v=7WfRuEuvIso

Here is an example article in git :
http://github.com/h4ck3rm1k3/KosovoWikipedia/blame/master/Wiki/Kosovo/article.xml

Here is my source code, I am adding in code for Wordnet, Myspell and
other sources as well.

https://code.launchpad.net/~jamesmikedupont/+junk/wikiatransfer

Here is an older discussion back in 2008:
http://www.gossamer-threads.com/lists/wiki/foundation/121420

Here is the mail to the git list :
http://fmtyewtk.blogspot.com/2009/10/mail-to-git-list.html


Here are some blog posts from me :

http://fmtyewtk.blogspot.com/2009/10/wikipedia-meets-git.html
http://fmtyewtk.blogspot.com/2009/10/mail-to-git-list.html
http://fmtyewtk.blogspot.com/2009/10/gitblame-hack-for-char-by-char-diffs.html
http://fmtyewtk.blogspot.com/2009/10/idea-about-git-blame.html
http://fmtyewtk.blogspot.com/2009/10/there-are-few-wikis-built-on-top-of-git.html

jamesmi...@googlemail.com

unread,
Oct 25, 2009, 7:37:54 AM10/25/09
to mediawiki-vcs
I have now used apertium to process the article revisions as an
example of using
natural language processing for the individual versions. In addition
an html rendering is checked in. With this process, we could have a
process that would check out the latest versions and render them, in a
distributed manner.

There is no reason to render immediately, people can render on check
in of the xml and check in the html as well. Editors should be able to
edit in a wysiwyg manner, ideally with openoffice.

I have checked in a new tool, histproc with will visit each revision
in order, restore it from patch, and allow you to process it and the
previous version with some tool. (apertium for example) then the
results are all checked in.

That will allow you to apply any tool in the processing.

The next types of processing I would like to add are :

1. parsing trees of the sentences, finding the parts of the sentences.
2. Extracting all the Nouns and verbs in a unique list from each
revision, that would allow you to see where an name was introduced for
the first time or what name was rejected and not in the current
revision.

The next steps in processing would be to start experimenting with :
1. recursive category extraction tool, one to get all the articles in
a catagory and all the subcategories.
2. the ability to daily import all changes into the git repository.
3. We should setup a git repository for a larger set of articles.


Here are the outputs of the processing, each revision is checked in.
http://github.com/h4ck3rm1k3/KosovoWikipedia/tree/master/Wiki/Kosovo/processing/

Using this tool, the histproc, which will allow you to visit each
version and check in the results

http://bazaar.launchpad.net/%7Ejamesmikedupont/introspectorreader/wikipedia-strategy/annotate/head%3A/apertium/histproc.pl

The process.sh is used to process the versions :

http://bazaar.launchpad.net/%7Ejamesmikedupont/introspectorreader/wikipedia-strategy/annotate/head%3A/apertium/process.sh

* article.xml.html
* article.xml.html.1
* article.xml.html.am
* article.xml.html.tag
* article.xml.html.pre
* article.xml.html.fin
* article.xml._html

1. First we convert the xml to html , using xhtml::mediawiki

perl /home/mdupont/2009/10/strategyl/wikipedia-strategy/
convert.pl $1.xml > $1.html

2. Then the deshtml is run to produce the .1 file

apertium-deshtml $1.html > $1.html.1

3.

Then lt-proc for automorf is called,

cat $1.html.1| /usr/bin/lt-proc /usr/share/apertium/apertium-en-
es/en-es.automorf.bin > $1.html.am

4.

Then we tag the file

cat $1.html.am | /usr/bin/apertium-tagger -g /usr/share/apertium/
apertium-en-es/en-es.prob > $1.html.tag

5. The pretransfer is called,

cat $1.html.tag | /usr/bin/apertium-pretransfer > $1.html.pre

6. Then we break them into lines

cat $1.html.pre | perl -p -n -e's;\$;\$\n;g' > $1.html.fin

7. And reproduce valid html,

apertium-rehtml $1.html.fin > $1_.html

This has been posted here:
http://fmtyewtk.blogspot.com/2009/10/git-push-origin-master.html
Reply all
Reply to author
Forward
0 new messages