Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
How to translate a xpath in a BeautifulSoup tree
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  9 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Mr.SpOOn  
View profile  
 More options Mar 29 2009, 8:22 am
From: "Mr.SpOOn" <mr.spoo...@gmail.com>
Date: Sun, 29 Mar 2009 05:22:29 -0700 (PDT)
Local: Sun, Mar 29 2009 8:22 am
Subject: How to translate a xpath in a BeautifulSoup tree
Hi,
maybe the subject is not so clear so I'll explain here what I mean.

For a web site I'm going to create I need to get informations from
different web sites that doesn't have RSS feeds.

The number of websites to parse is not fixed, so I can always find a
new one to parse and they obviously doesn't have a similar structure
so I need to build a parser for every site. It's a daunting work so
I'd like to find an easy and fast way to get the elements I need from
the page.

I thought I could use the FireBug extension to find the element I
need. When I click on an element I found out it can give me the xpath,
that is a string like this:

/html/body/div/table[3]/tbody/tr/td[3]/table[3]

Is there a way to use this in BeautifulSoup?
I mean, I have the soup of the page.

I tried something like this:

soup.html.body.div

And this is ok to get to that div. But how can I jump to the third
table under div?
And why if I do:

div = soup.html.body.div
div.nextSibling

it prints out this:  u'\n'

Thanks,
bye


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Pratik Dam  
View profile  
 More options Mar 30 2009, 1:28 pm
From: Pratik Dam <prat...@gmail.com>
Date: Mon, 30 Mar 2009 22:58:14 +0530
Local: Mon, Mar 30 2009 1:28 pm
Subject: Re: How to translate a xpath in a BeautifulSoup tree

 This   extra newlines can be rmoved   soup2 =
BeautifulSoup(soup.prettify()) . You were getting the newlines it is a
TextNode

The xpath syntax youa er talking about should work . Can you pretiffy  and
then check again


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
furyu  
View profile  
 More options Apr 8 2009, 12:31 am
From: furyu <fury...@gmail.com>
Date: Tue, 7 Apr 2009 21:31:29 -0700 (PDT)
Local: Wed, Apr 8 2009 12:31 am
Subject: Re: How to translate a xpath in a BeautifulSoup tree
I released XPathEvaluator Extension for BeautifulSoup.
http://furyu-tei.sakura.ne.jp/archives/BSXPath.zip

With BSXPathEvaluator(*), you can use XPath to find elements from
HTML.
(*)Wrapper(sub-class) of BeautifulSoup. Required Python 2.5* and
BeautifulSoup 3.07* or 3.1*

For example:

from BSXPath import BSXPathEvaluator,XPathResult

html = """
<html><head><title>Hello, DOM 3 XPath!</title></head>
<body><h1>Hello, DOM 3 XPath!</h1><p>This is XPathEvaluator Extension
for BeautifulSoup.</p>
<p>This is based on JavaScript-XPath!</p></body>
"""
document = BSXPathEvaluator(html)

result = document.evaluate('//h1/text()
[1]',document,None,XPathResult.STRING_TYPE,None)
print result.stringValue
# Hello, DOM 3 XPath!

result = document.evaluate('//
h1',document,None,XPathResult.ORDERED_NODE_SNAPSHOT_TYPE,None)
print result.snapshotLength
# 1
print result.snapshotItem(0)
# <h1>Hello, DOM 3 XPath!</h1>

For details, read comment of BSXPath.py.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Mr.SpOOn  
View profile  
 More options Apr 10 2009, 1:58 pm
From: "Mr.SpOOn" <mr.spoo...@gmail.com>
Date: Fri, 10 Apr 2009 19:58:22 +0200
Local: Fri, Apr 10 2009 1:58 pm
Subject: Re: How to translate a xpath in a BeautifulSoup tree
2009/4/8 furyu <fury...@gmail.com>:

> I released XPathEvaluator Extension for BeautifulSoup.
> http://furyu-tei.sakura.ne.jp/archives/BSXPath.zip

> With BSXPathEvaluator(*), you can use XPath to find elements from
> HTML.

Thans :D
I'm gonna try it.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Mr.SpOOn  
View profile  
 More options Apr 14 2009, 8:15 am
From: "Mr.SpOOn" <mr.spoo...@gmail.com>
Date: Tue, 14 Apr 2009 05:15:43 -0700 (PDT)
Local: Tues, Apr 14 2009 8:15 am
Subject: Re: How to translate a xpath in a BeautifulSoup tree
Pratik Dam:

>  This   extra newlines can be rmoved   soup2 =
> BeautifulSoup(soup.prettify()) . You were getting the newlines it is a
> TextNode

> The xpath syntax youa er talking about should work . Can you pretiffy  and
> then check again

Sorry, it seems I somehow missed your reply.
Actually I still get the u'\n'.

I did this:

page = urllib2.urlopen("http://www.estragon.it/programma.htm")
soup = BeautifulSoup(page)
soup2 = BeautifulSoup(soup.prettify())

soup.html.body.div.nextSibling
#u'\n'
soup2.html.body.div.nextSibling
#u'\n'

furyu:

>I released XPathEvaluator Extension for BeautifulSoup.
>http://furyu-tei.sakura.ne.jp/archives/BSXPath.zip

Well, this is cool, but I still need to use BeautifulSoup. It would be
cool if I could extract a part of the page using a XPath query and the
being able to work on it with BeautifulSoup.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
furyu  
View profile  
 More options Apr 14 2009, 9:59 am
From: furyu <fury...@gmail.com>
Date: Tue, 14 Apr 2009 06:59:55 -0700 (PDT)
Local: Tues, Apr 14 2009 9:59 am
Subject: Re: How to translate a xpath in a BeautifulSoup tree
Hi, Mr.SpOOn

Try below:
page = urllib2.urlopen("http://www.estragon.it/programma.htm")
soup = BeautifulSoup(page)
result1 = soup.html.body.div.findAll('table')[2].tr.findAll('td')
[2].findAll('table')[2]
print result1
# You will get the target table.
result2 = soup.html.body.div('table')[2].tr('td')[2]('table')[2]
print result1==result2
# True (same result)

Also,
| Well, this is cool, but I still need to use BeautifulSoup. It would
be
| cool if I could extract a part of the page using a XPath query and
the
| being able to work on it with BeautifulSoup.

BSXPathEvaluator is sub-class of BeautifulSoup, so you can use same
functions.

Try below:
page = urllib2.urlopen("http://www.estragon.it/programma.htm")
soup = BSXPathEvaluator(page)
result3 = soup.getFirstItem('/html/body/div/table[3]/tbody/tr/td[3]/
table[3]')
print result3
# You will get the target table.
result4 = soup.html.body.div.findAll('table')[2].tr.findAll('td')
[2].findAll('table')[2]
print result3==result4
# True (same result)


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
stevesteffler  
View profile  
 More options May 26 2009, 3:24 pm
From: stevesteffler <stevesteff...@gmail.com>
Date: Tue, 26 May 2009 12:24:41 -0700 (PDT)
Local: Tues, May 26 2009 3:24 pm
Subject: Re: How to translate a xpath in a BeautifulSoup tree

Hello Furyu,

Thanks very much for your work on BSXPath.py, I am using it in a
project and it works great.  My development environment was running
Python 2.5 and the server I am uploading my code onto uses Python 2.4,
so I'm getting the following error when trying to import BSXPath:

The code:

from BSXPath import BSXPathEvaluator,XPathResult

Results in this result (copied from my test script):

Traceback (most recent call last):
  File "hello.py", line 7, in ?
    import test.py  # main CGI functionality in 'my_cgi.py'
  File "/home/hlcc/public_html/handl/test.py", line 7, in ?
    from BSXPath import BSXPathEvaluator,XPathResult
  File "/home/hlcc/public_html/handl/BSXPath.py", line 138
     return u'true' if obj else u'false'
                     ^
 SyntaxError: invalid syntax

Got any ideas how to address this?

Regards,
Steve Steffler

On Apr 14, 7:59 am, furyu <fury...@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
furyu  
View profile  
 More options May 27 2009, 6:09 am
From: furyu <fury...@gmail.com>
Date: Wed, 27 May 2009 03:09:05 -0700 (PDT)
Local: Wed, May 27 2009 6:09 am
Subject: Re: How to translate a xpath in a BeautifulSoup tree
Hello Steve Steffler,

I'm sorry, I developed BSXPath.py on Python 2.5 and did not test it on
others.

>   File "/home/hlcc/public_html/handl/BSXPath.py", line 138
>      return u'true' if obj else u'false'
>                      ^
>  SyntaxError: invalid syntax

"true_value if condition else false_value" seems to be a new syntax in
Python 2.5, so it caused that error on Python 2.4.

|      return u'true' if obj else u'false'

has the same effect as the following

| if obj:
|   return u'true'
| else:
|   return u'false'

and the latter would work on Python 2.4.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Steve Steffler  
View profile  
 More options May 28 2009, 1:19 am
From: Steve Steffler <stevesteff...@gmail.com>
Date: Wed, 27 May 2009 23:19:23 -0600
Local: Thurs, May 28 2009 1:19 am
Subject: Re: How to translate a xpath in a BeautifulSoup tree

Hello,

Thanks for the reply.  I tried replacing all of that syntax throughout the
script, and afterwards I was getting this error instead when trying to do a
document.getFirstItem() XPath lookup.

  File "hello.py", line 8, in ?
    test.main()
  File "/home/hlcc/public_html/handl-scrape/test.py", line 18, in main
    html_wrapped = "<html>" +
str(document.getFirstItem('//select[@name="Page"]')) + "</html>"
  File "/home/hlcc/public_html/handl-scrape/BSXPath.py", line 2627, in
getFirstItem

elm=self.evaluate(expr,context,None,XPathResult.FIRST_ORDERED_NODE_TYPE,Non e).singleNodeValue
  File "/home/hlcc/public_html/handl-scrape/BSXPath.py", line 2617, in
evaluate
    return self.createExpression(expr,resolver).evaluate(context,type)
  File "/home/hlcc/public_html/handl-scrape/BSXPath.py", line 2606, in
createExpression
    return XPathExpression(expr,resolver)
  File "/home/hlcc/public_html/handl-scrape/BSXPath.py", line 2537, in
__init__
    self.expr=BinaryExpr.parse(lexer)
  File "/home/hlcc/public_html/handl-scrape/BSXPath.py", line 1030, in parse
    expr=UnaryExpr.parse(lexer)
  File "/home/hlcc/public_html/handl-scrape/BSXPath.py", line 1087, in parse
    return UnionExpr.parse(lexer)
  File "/home/hlcc/public_html/handl-scrape/BSXPath.py", line 1132, in parse
    expr=PathExpr.parse(lexer)
  File "/home/hlcc/public_html/handl-scrape/BSXPath.py", line 1246, in parse
    path=PathExpr(FilterExpr.root()) # RootExpr
  File "/home/hlcc/public_html/handl-scrape/BSXPath.py", line 1158, in
__init__
    self.needContextPosition=filter.needContextPosition
AttributeError: 'FunctionCall' object has no attribute 'needContextPosition'

So that makes me think that the restructuring of the code doesn't exactly
work in the same way, or else I made a typo of some sort.

I also found this:
http://www.guyrutenberg.com/2007/10/12/conditional-expressions-in-pyt...

Could this be the correct way to do it in 2.4, do you think?

Thanks again,
Steve


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »