How to translate a xpath in a BeautifulSoup tree

5,465 views
Skip to first unread message

Mr.SpOOn

unread,
Mar 29, 2009, 8:22:29 AM3/29/09
to beautifulsoup
Hi,
maybe the subject is not so clear so I'll explain here what I mean.

For a web site I'm going to create I need to get informations from
different web sites that doesn't have RSS feeds.

The number of websites to parse is not fixed, so I can always find a
new one to parse and they obviously doesn't have a similar structure
so I need to build a parser for every site. It's a daunting work so
I'd like to find an easy and fast way to get the elements I need from
the page.

I thought I could use the FireBug extension to find the element I
need. When I click on an element I found out it can give me the xpath,
that is a string like this:

/html/body/div/table[3]/tbody/tr/td[3]/table[3]

Is there a way to use this in BeautifulSoup?
I mean, I have the soup of the page.

I tried something like this:

soup.html.body.div

And this is ok to get to that div. But how can I jump to the third
table under div?
And why if I do:

div = soup.html.body.div
div.nextSibling

it prints out this: u'\n'

Thanks,
bye

Pratik Dam

unread,
Mar 30, 2009, 1:28:14 PM3/30/09
to beauti...@googlegroups.com
 This   extra newlines can be rmoved   soup2 = BeautifulSoup(soup.prettify()) . You were getting the newlines it is a TextNode
 
The xpath syntax youa er talking about should work . Can you pretiffy  and then check again

furyu

unread,
Apr 8, 2009, 12:31:29 AM4/8/09
to beautifulsoup
I released XPathEvaluator Extension for BeautifulSoup.
http://furyu-tei.sakura.ne.jp/archives/BSXPath.zip

With BSXPathEvaluator(*), you can use XPath to find elements from
HTML.
(*)Wrapper(sub-class) of BeautifulSoup. Required Python 2.5* and
BeautifulSoup 3.07* or 3.1*

For example:

from BSXPath import BSXPathEvaluator,XPathResult

html = """
<html><head><title>Hello, DOM 3 XPath!</title></head>
<body><h1>Hello, DOM 3 XPath!</h1><p>This is XPathEvaluator Extension
for BeautifulSoup.</p>
<p>This is based on JavaScript-XPath!</p></body>
"""
document = BSXPathEvaluator(html)

result = document.evaluate('//h1/text()
[1]',document,None,XPathResult.STRING_TYPE,None)
print result.stringValue
# Hello, DOM 3 XPath!

result = document.evaluate('//
h1',document,None,XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,None)
print result.snapshotLength
# 1
print result.snapshotItem(0)
# <h1>Hello, DOM 3 XPath!</h1>

For details, read comment of BSXPath.py.

Mr.SpOOn

unread,
Apr 10, 2009, 1:58:22 PM4/10/09
to beauti...@googlegroups.com
2009/4/8 furyu <fur...@gmail.com>:

>
> I released XPathEvaluator Extension for BeautifulSoup.
> http://furyu-tei.sakura.ne.jp/archives/BSXPath.zip
>
> With BSXPathEvaluator(*), you can use XPath to find elements from
> HTML.

Thans :D
I'm gonna try it.

Mr.SpOOn

unread,
Apr 14, 2009, 8:15:43 AM4/14/09
to beautifulsoup
Pratik Dam:
>  This   extra newlines can be rmoved   soup2 =
> BeautifulSoup(soup.prettify()) . You were getting the newlines it is a
> TextNode
>
> The xpath syntax youa er talking about should work . Can you pretiffy  and
> then check again

Sorry, it seems I somehow missed your reply.
Actually I still get the u'\n'.

I did this:

page = urllib2.urlopen("http://www.estragon.it/programma.htm")
soup = BeautifulSoup(page)
soup2 = BeautifulSoup(soup.prettify())

soup.html.body.div.nextSibling
#u'\n'
soup2.html.body.div.nextSibling
#u'\n'

furyu:
>I released XPathEvaluator Extension for BeautifulSoup.
>http://furyu-tei.sakura.ne.jp/archives/BSXPath.zip

Well, this is cool, but I still need to use BeautifulSoup. It would be
cool if I could extract a part of the page using a XPath query and the
being able to work on it with BeautifulSoup.

furyu

unread,
Apr 14, 2009, 9:59:55 AM4/14/09
to beautifulsoup
Hi, Mr.SpOOn

Try below:
page = urllib2.urlopen("http://www.estragon.it/programma.htm")
soup = BeautifulSoup(page)
result1 = soup.html.body.div.findAll('table')[2].tr.findAll('td')
[2].findAll('table')[2]
print result1
# You will get the target table.
result2 = soup.html.body.div('table')[2].tr('td')[2]('table')[2]
print result1==result2
# True (same result)

Also,
| Well, this is cool, but I still need to use BeautifulSoup. It would
be
| cool if I could extract a part of the page using a XPath query and
the
| being able to work on it with BeautifulSoup.

BSXPathEvaluator is sub-class of BeautifulSoup, so you can use same
functions.

Try below:
page = urllib2.urlopen("http://www.estragon.it/programma.htm")
soup = BSXPathEvaluator(page)
result3 = soup.getFirstItem('/html/body/div/table[3]/tbody/tr/td[3]/
table[3]')
print result3
# You will get the target table.
result4 = soup.html.body.div.findAll('table')[2].tr.findAll('td')
[2].findAll('table')[2]
print result3==result4
# True (same result)

stevesteffler

unread,
May 26, 2009, 3:24:41 PM5/26/09
to beautifulsoup

Hello Furyu,

Thanks very much for your work on BSXPath.py, I am using it in a
project and it works great. My development environment was running
Python 2.5 and the server I am uploading my code onto uses Python 2.4,
so I'm getting the following error when trying to import BSXPath:

The code:

from BSXPath import BSXPathEvaluator,XPathResult

Results in this result (copied from my test script):

Traceback (most recent call last):
File "hello.py", line 7, in ?
import test.py # main CGI functionality in 'my_cgi.py'
File "/home/hlcc/public_html/handl/test.py", line 7, in ?
from BSXPath import BSXPathEvaluator,XPathResult
File "/home/hlcc/public_html/handl/BSXPath.py", line 138
return u'true' if obj else u'false'
^
SyntaxError: invalid syntax


Got any ideas how to address this?


Regards,
Steve Steffler

furyu

unread,
May 27, 2009, 6:09:05 AM5/27/09
to beautifulsoup
Hello Steve Steffler,

I'm sorry, I developed BSXPath.py on Python 2.5 and did not test it on
others.

>   File "/home/hlcc/public_html/handl/BSXPath.py", line 138
>      return u'true' if obj else u'false'
>                      ^
>  SyntaxError: invalid syntax

"true_value if condition else false_value" seems to be a new syntax in
Python 2.5, so it caused that error on Python 2.4.

| return u'true' if obj else u'false'

has the same effect as the following

| if obj:
| return u'true'
| else:
| return u'false'

and the latter would work on Python 2.4.

Steve Steffler

unread,
May 28, 2009, 1:19:23 AM5/28/09
to beauti...@googlegroups.com

Hello,

Thanks for the reply.  I tried replacing all of that syntax throughout the script, and afterwards I was getting this error instead when trying to do a document.getFirstItem() XPath lookup.

  File "hello.py", line 8, in ?
    test.main()
  File "/home/hlcc/public_html/handl-scrape/test.py", line 18, in main
    html_wrapped = "<html>" + str(document.getFirstItem('//select[@name="Page"]')) + "</html>"
  File "/home/hlcc/public_html/handl-scrape/BSXPath.py", line 2627, in getFirstItem
    elm=self.evaluate(expr,context,None,XPathResult.FIRST_ORDERED_NODE_TYPE,None).singleNodeValue
  File "/home/hlcc/public_html/handl-scrape/BSXPath.py", line 2617, in evaluate
    return self.createExpression(expr,resolver).evaluate(context,type)
  File "/home/hlcc/public_html/handl-scrape/BSXPath.py", line 2606, in createExpression
    return XPathExpression(expr,resolver)
  File "/home/hlcc/public_html/handl-scrape/BSXPath.py", line 2537, in __init__
    self.expr=BinaryExpr.parse(lexer)
  File "/home/hlcc/public_html/handl-scrape/BSXPath.py", line 1030, in parse
    expr=UnaryExpr.parse(lexer)
  File "/home/hlcc/public_html/handl-scrape/BSXPath.py", line 1087, in parse
    return UnionExpr.parse(lexer)
  File "/home/hlcc/public_html/handl-scrape/BSXPath.py", line 1132, in parse
    expr=PathExpr.parse(lexer)
  File "/home/hlcc/public_html/handl-scrape/BSXPath.py", line 1246, in parse
    path=PathExpr(FilterExpr.root()) # RootExpr
  File "/home/hlcc/public_html/handl-scrape/BSXPath.py", line 1158, in __init__
    self.needContextPosition=filter.needContextPosition
AttributeError: 'FunctionCall' object has no attribute 'needContextPosition'

So that makes me think that the restructuring of the code doesn't exactly work in the same way, or else I made a typo of some sort.

I also found this: http://www.guyrutenberg.com/2007/10/12/conditional-expressions-in-python-24/

Could this be the correct way to do it in 2.4, do you think?

Thanks again,
Steve
Reply all
Reply to author
Forward
0 new messages