Norlesh
unread,Jan 6, 2011, 12:22:42 AM1/6/11Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to mwlib
Well I had noticed a bit of a shortfall in documentation so figured I
would make some notes while I figured it out for myself.
Hope this will be useful for any newcomers that want to use mwlib
inside there own scripts.
cheers, Shane Norris
PS if anyone have more notes to add please share
Programming with mwlib 101.
-------------------------------------------
[the following assumes you have successfully installed mwlib in your
environment already]
1. Unless you already have a local mediawiki set up (I'm still sore
from trying to get en:wikipedia imported onto a local install) your
going to want to convert your xml dump files into a CDB database -
from the command line:
mw-buildcdb --input=/xml/dump/file.bz2 --output=/my/output/directory
this gives you three files in the specified directory:
wikiidx.cdb - the articles index file.
wikidata.bin - the wikitext for the articles.
wikiconf.txt - config file for mwlib (you will need to update this if
you move the files later).
2. next inside your python code your going to want access to the
articles database:
from mwlib import wiki
env = wiki.makewiki('/location/of/wikiconf.txt') # wiki Environment
object
3. now to access the parse tree of an article:
a = env.wiki.getParseArticle('name-of-article')
The result is a parse tree of nodes (starting with an Article node in
this case).
[see mwlib/parser/nodes.py for the different possible node types along
with there attributes]
4. to walk the tree each node iterates over its immediate children.
from mwlib.parser import nodes
for section in a:
if section.__class__ == nodes.Section:
print section.firstchild.asText() # first child is the
sections caption
# ... do stuff for this section
or you can access all its descendants in document order with either
node.allchildren()
for child in a:
if child.__class__ == nodes.Text:
pass # ... do stuff
elif child.__class__ == nodes.Item
pass # ... do other stuff
or if you prefer there is a node.filter(..expr..) method.
parts = []
for t in a.filter(lambda x: x.__class__==nodes.Text or isinstance(x,
nodes.Link)):
if isinstance(t, nodes.Link):
if t.firstchild is None: # link doesn't have any
replacement text so use target
parts.append(t.target)
# otherwise don't print, the replacement text will show up
next iteration
else:
parts.append(t.asText())
# still has some kruft such as inline references and language link
content that need
# removing before its useful for NLP but you get the picture
print ''.join(parts)
from there you just need to figure which nodes you are interested in
and work your magic, good luck!