I have now used apertium to process the article revisions as an
example of using
natural language processing for the individual versions. In addition
an html rendering is checked in. With this process, we could have a
process that would check out the latest versions and render them, in a
distributed manner.
There is no reason to render immediately, people can render on check
in of the xml and check in the html as well. Editors should be able to
edit in a wysiwyg manner, ideally with openoffice.
I have checked in a new tool, histproc with will visit each revision
in order, restore it from patch, and allow you to process it and the
previous version with some tool. (apertium for example) then the
results are all checked in.
That will allow you to apply any tool in the processing.
The next types of processing I would like to add are :
1. parsing trees of the sentences, finding the parts of the sentences.
2. Extracting all the Nouns and verbs in a unique list from each
revision, that would allow you to see where an name was introduced for
the first time or what name was rejected and not in the current
revision.
The next steps in processing would be to start experimenting with :
1. recursive category extraction tool, one to get all the articles in
a catagory and all the subcategories.
2. the ability to daily import all changes into the git repository.
3. We should setup a git repository for a larger set of articles.
Here are the outputs of the processing, each revision is checked in.
http://github.com/h4ck3rm1k3/KosovoWikipedia/tree/master/Wiki/Kosovo/processing/
Using this tool, the histproc, which will allow you to visit each
version and check in the results
http://bazaar.launchpad.net/%7Ejamesmikedupont/introspectorreader/wikipedia-strategy/annotate/head%3A/apertium/histproc.pl
The process.sh is used to process the versions :
http://bazaar.launchpad.net/%7Ejamesmikedupont/introspectorreader/wikipedia-strategy/annotate/head%3A/apertium/process.sh
* article.xml.html
* article.xml.html.1
*
article.xml.html.am
* article.xml.html.tag
* article.xml.html.pre
* article.xml.html.fin
* article.xml._html
1. First we convert the xml to html , using xhtml::mediawiki
perl /home/mdupont/2009/10/strategyl/wikipedia-strategy/
convert.pl $1.xml > $1.html
2. Then the deshtml is run to produce the .1 file
apertium-deshtml $1.html > $1.html.1
3.
Then lt-proc for automorf is called,
cat $1.html.1| /usr/bin/lt-proc /usr/share/apertium/apertium-en-
es/en-es.automorf.bin > $
1.html.am
4.
Then we tag the file
cat $
1.html.am | /usr/bin/apertium-tagger -g /usr/share/apertium/
apertium-en-es/en-es.prob > $1.html.tag
5. The pretransfer is called,
cat $1.html.tag | /usr/bin/apertium-pretransfer > $1.html.pre
6. Then we break them into lines
cat $1.html.pre | perl -p -n -e's;\$;\$\n;g' > $1.html.fin
7. And reproduce valid html,
apertium-rehtml $1.html.fin > $1_.html
This has been posted here:
http://fmtyewtk.blogspot.com/2009/10/git-push-origin-master.html