Reverse XPath

38 views

Skip to first unread message

disappearedng

unread,

Sep 3, 2009, 11:43:10 PM9/3/09

to beautifulsoup

Dear everyone,

Below I present a getXPath function which produce a query so that
BSXPath written by furyu could be mapped to a string ( and back again
by BSXPath.evaluate).

I somehow need this function for an app I am building right now. It's
pretty ineffecient, but it gets the job done (most of the time). It
only works on BS nodes, nothing else. I was looking for ways to
optimize it (like perhaps we could use specific attributes like id [i
didn't add it in yet]). However, I am not really familiar with xpath
that I don't think I know enough to optimize this and design the test
cases. Will this community make an effort to improve this?

def getXPath(node):
"""
only with NODES, nothing else, from BeautifulSoup
NOTE IT COULD BE TRICKY TO CHECK, you HAVE TO TURN OFFLINE to be
able to navigate properly
and check, since some pages, i.e. cnn. generates dom at run time
"""
def sameNameWithinParent(node):
"""
Identifies all the nodes within its parent which has the same
"name" attribute
from the node (meaning findAll( 'a') for example
Note that by standards of M$, children will begin at 1
"""
pos = 1
for each in node.parent.contents:
if each == node:
return pos
elif hasattr( each, "name"):
# Note that this is to ensure we deal with nodes
if each.name == node.name:
pos += 1
# Check
if not hasattr(node, "contents"):
raise Exception("Must be a valid BS node, nothing else")
l = [] # l will be [ (name, pos), (name, pos) .. ]
while node.parent:
pos = sameNameWithinParent( node)
l.append( (node.name, pos))
node = node.parent
l.reverse() # Bad practise?
ret = '/' + '/'.join( [ each[0] + '[' + str(each[1]) + ']' for
each in l ])
return ret

If anyone has been able to find bugs or optimize it, Please let me
know. My email is disapp...@gmail.com
I haven't been able to stress test it, but it runs about 3 seconds /
100 queries ( avg nodes, nothing fancy).

disappearedng

Aaron DeVore

unread,

Sep 4, 2009, 1:59:14 PM9/4/09

to beauti...@googlegroups.com

Could you put that on some sort of a pastebin-like web site? I'm
having a hard time reading it.

Thanks,
Aaron

Reply all

Reply to author

Forward

0 new messages