Hi,
maybe the subject is not so clear so I'll explain here what I mean.
For a web site I'm going to create I need to get informations from
different web sites that doesn't have RSS feeds.
The number of websites to parse is not fixed, so I can always find a
new one to parse and they obviously doesn't have a similar structure
so I need to build a parser for every site. It's a daunting work so
I'd like to find an easy and fast way to get the elements I need from
the page.
I thought I could use the FireBug extension to find the element I
need. When I click on an element I found out it can give me the xpath,
that is a string like this:
/html/body/div/table[3]/tbody/tr/td[3]/table[3]
Is there a way to use this in BeautifulSoup?
I mean, I have the soup of the page.
I tried something like this:
soup.html.body.div
And this is ok to get to that div. But how can I jump to the third
table under div?
And why if I do:
On Sun, Mar 29, 2009 at 5:52 PM, Mr.SpOOn <mr.spoo...@gmail.com> wrote:
> Hi,
> maybe the subject is not so clear so I'll explain here what I mean.
> For a web site I'm going to create I need to get informations from
> different web sites that doesn't have RSS feeds.
> The number of websites to parse is not fixed, so I can always find a
> new one to parse and they obviously doesn't have a similar structure
> so I need to build a parser for every site. It's a daunting work so
> I'd like to find an easy and fast way to get the elements I need from
> the page.
> I thought I could use the FireBug extension to find the element I
> need. When I click on an element I found out it can give me the xpath,
> that is a string like this:
> /html/body/div/table[3]/tbody/tr/td[3]/table[3]
> Is there a way to use this in BeautifulSoup?
> I mean, I have the soup of the page.
> I tried something like this:
> soup.html.body.div
> And this is ok to get to that div. But how can I jump to the third
> table under div?
> And why if I do:
With BSXPathEvaluator(*), you can use XPath to find elements from
HTML.
(*)Wrapper(sub-class) of BeautifulSoup. Required Python 2.5* and
BeautifulSoup 3.07* or 3.1*
For example:
from BSXPath import BSXPathEvaluator,XPathResult
html = """
<html><head><title>Hello, DOM 3 XPath!</title></head>
<body><h1>Hello, DOM 3 XPath!</h1><p>This is XPathEvaluator Extension
for BeautifulSoup.</p>
<p>This is based on JavaScript-XPath!</p></body>
"""
document = BSXPathEvaluator(html)
result = document.evaluate('//h1/text()
[1]',document,None,XPathResult.STRING_TYPE,None)
print result.stringValue
# Hello, DOM 3 XPath!
result = document.evaluate('//
h1',document,None,XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,None)
print result.snapshotLength
# 1
print result.snapshotItem(0)
# <h1>Hello, DOM 3 XPath!</h1>
Well, this is cool, but I still need to use BeautifulSoup. It would be
cool if I could extract a part of the page using a XPath query and the
being able to work on it with BeautifulSoup.
Try below:
page = urllib2.urlopen("http://www.estragon.it/programma.htm")
soup = BeautifulSoup(page)
result1 = soup.html.body.div.findAll('table')[2].tr.findAll('td')
[2].findAll('table')[2]
print result1
# You will get the target table.
result2 = soup.html.body.div('table')[2].tr('td')[2]('table')[2]
print result1==result2
# True (same result)
Also,
| Well, this is cool, but I still need to use BeautifulSoup. It would
be
| cool if I could extract a part of the page using a XPath query and
the
| being able to work on it with BeautifulSoup.
BSXPathEvaluator is sub-class of BeautifulSoup, so you can use same
functions.
Try below:
page = urllib2.urlopen("http://www.estragon.it/programma.htm")
soup = BSXPathEvaluator(page)
result3 = soup.getFirstItem('/html/body/div/table[3]/tbody/tr/td[3]/
table[3]')
print result3
# You will get the target table.
result4 = soup.html.body.div.findAll('table')[2].tr.findAll('td')
[2].findAll('table')[2]
print result3==result4
# True (same result)
Thanks very much for your work on BSXPath.py, I am using it in a
project and it works great. My development environment was running
Python 2.5 and the server I am uploading my code onto uses Python 2.4,
so I'm getting the following error when trying to import BSXPath:
The code:
from BSXPath import BSXPathEvaluator,XPathResult
Results in this result (copied from my test script):
Traceback (most recent call last):
File "hello.py", line 7, in ?
import test.py # main CGI functionality in 'my_cgi.py'
File "/home/hlcc/public_html/handl/test.py", line 7, in ?
from BSXPath import BSXPathEvaluator,XPathResult
File "/home/hlcc/public_html/handl/BSXPath.py", line 138
return u'true' if obj else u'false'
^
SyntaxError: invalid syntax
Got any ideas how to address this?
Regards,
Steve Steffler
On Apr 14, 7:59 am, furyu <fury...@gmail.com> wrote:
> Try below:
> page = urllib2.urlopen("http://www.estragon.it/programma.htm")
> soup = BeautifulSoup(page)
> result1 = soup.html.body.div.findAll('table')[2].tr.findAll('td')
> [2].findAll('table')[2]
> print result1
> # You will get the target table.
> result2 = soup.html.body.div('table')[2].tr('td')[2]('table')[2]
> print result1==result2
> # True (same result)
> Also,
> | Well, this is cool, but I still need to use BeautifulSoup. It would
> be
> | cool if I could extract a part of the page using a XPath query and
> the
> | being able to work on it with BeautifulSoup.
> BSXPathEvaluator is sub-class of BeautifulSoup, so you can use same
> functions.
Thanks for the reply. I tried replacing all of that syntax throughout the
script, and afterwards I was getting this error instead when trying to do a
document.getFirstItem() XPath lookup.
File "hello.py", line 8, in ?
test.main()
File "/home/hlcc/public_html/handl-scrape/test.py", line 18, in main
html_wrapped = "<html>" +
str(document.getFirstItem('//select[@name="Page"]')) + "</html>"
File "/home/hlcc/public_html/handl-scrape/BSXPath.py", line 2627, in
getFirstItem
elm=self.evaluate(expr,context,None,XPathResult.FIRST_ORDERED_NODE_TYPE,Non e).singleNodeValue
File "/home/hlcc/public_html/handl-scrape/BSXPath.py", line 2617, in
evaluate
return self.createExpression(expr,resolver).evaluate(context,type)
File "/home/hlcc/public_html/handl-scrape/BSXPath.py", line 2606, in
createExpression
return XPathExpression(expr,resolver)
File "/home/hlcc/public_html/handl-scrape/BSXPath.py", line 2537, in
__init__
self.expr=BinaryExpr.parse(lexer)
File "/home/hlcc/public_html/handl-scrape/BSXPath.py", line 1030, in parse
expr=UnaryExpr.parse(lexer)
File "/home/hlcc/public_html/handl-scrape/BSXPath.py", line 1087, in parse
return UnionExpr.parse(lexer)
File "/home/hlcc/public_html/handl-scrape/BSXPath.py", line 1132, in parse
expr=PathExpr.parse(lexer)
File "/home/hlcc/public_html/handl-scrape/BSXPath.py", line 1246, in parse
path=PathExpr(FilterExpr.root()) # RootExpr
File "/home/hlcc/public_html/handl-scrape/BSXPath.py", line 1158, in
__init__
self.needContextPosition=filter.needContextPosition
AttributeError: 'FunctionCall' object has no attribute 'needContextPosition'
So that makes me think that the restructuring of the code doesn't exactly
work in the same way, or else I made a typo of some sort.