Programming with mwlib: 101

29 views

Skip to first unread message

Norlesh

unread,

Jan 6, 2011, 12:22:42 AM1/6/11

to mwlib

Well I had noticed a bit of a shortfall in documentation so figured I
would make some notes while I figured it out for myself.
Hope this will be useful for any newcomers that want to use mwlib
inside there own scripts.
cheers, Shane Norris

PS if anyone have more notes to add please share

Programming with mwlib 101.
-------------------------------------------
[the following assumes you have successfully installed mwlib in your
environment already]

1. Unless you already have a local mediawiki set up (I'm still sore
from trying to get en:wikipedia imported onto a local install) your
going to want to convert your xml dump files into a CDB database -
from the command line:

mw-buildcdb --input=/xml/dump/file.bz2 --output=/my/output/directory

this gives you three files in the specified directory:
wikiidx.cdb - the articles index file.
wikidata.bin - the wikitext for the articles.
wikiconf.txt - config file for mwlib (you will need to update this if
you move the files later).

2. next inside your python code your going to want access to the
articles database:

from mwlib import wiki
env = wiki.makewiki('/location/of/wikiconf.txt') # wiki Environment
object

3. now to access the parse tree of an article:

a = env.wiki.getParseArticle('name-of-article')

The result is a parse tree of nodes (starting with an Article node in
this case).
[see mwlib/parser/nodes.py for the different possible node types along
with there attributes]

4. to walk the tree each node iterates over its immediate children.

from mwlib.parser import nodes

for section in a:
if section.__class__ == nodes.Section:
print section.firstchild.asText() # first child is the
sections caption
# ... do stuff for this section

or you can access all its descendants in document order with either
node.allchildren()

for child in a:
if child.__class__ == nodes.Text:
pass # ... do stuff
elif child.__class__ == nodes.Item
pass # ... do other stuff

or if you prefer there is a node.filter(..expr..) method.

parts = []
for t in a.filter(lambda x: x.__class__==nodes.Text or isinstance(x,
nodes.Link)):
if isinstance(t, nodes.Link):
if t.firstchild is None: # link doesn't have any
replacement text so use target
parts.append(t.target)
# otherwise don't print, the replacement text will show up
next iteration
else:
parts.append(t.asText())
# still has some kruft such as inline references and language link
content that need
# removing before its useful for NLP but you get the picture
print ''.join(parts)

from there you just need to figure which nodes you are interested in
and work your magic, good luck!

Reply all

Reply to author

Forward

0 new messages