scrap data through xpath from div that contains javascript in scrapy python

shiva krishna

unread,

Jun 12, 2012, 8:25:32 AM6/12/12

to scrapy...@googlegroups.com

I am working on scrapy , i am scraping a site and using `xpath` to scrape items.

But some of the `div` contains `javascript`, so when i used xpath until the `div id` that contains javascript code is returning an empty list,and without including that div element(which contains javascript) can able to fetch HTML data

HTML code

<h2>

</h2>

</div>

Spider Code

class ExampleSpider(BaseSpider):

name = "example"

domain_name = "www.example.com"

start_urls = ["http://www.example.com/jkl/index.php"]

def parse(self, response):

hxs = HtmlXPathSelector(response)

required_data = hxs.select('//div[@class="subContent2"]/div[@id="contentDetails"]/div[@class="eventDetails"]')

So how can i get `text(Some data)` from the `anchor tag` inside the `h2 element` as mentioned above, is there any alternate way for fetching data from the elements that contains javascript in scrapy

Максим Горковский

unread,

Jun 12, 2012, 10:36:42 AM6/12/12

to scrapy...@googlegroups.com

I think what you looking for is:

//div[@class="subContent2"]//h2/a/text()

2012/6/12 shiva krishna <shivak...@gmail.com>

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/iio2UO7O2OMJ.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

--
С уважением,
Максим Горковский

Steven Almeroth

unread,

Jun 12, 2012, 12:09:55 PM6/12/12

to scrapy...@googlegroups.com

works for me:

>>> response.body

' <div class="subContent2"> \n <div id="contentDetails">\n <div class="eventDetails">\n <h2>\n <a href="javascript:;" onclick="jdevents.getEvent(117032)">Some data</a>\n </h2>\n </div>\n </div>\n </div> \n'

>>> hxs.select('//div[@class="subContent2"]/div[@id="contentDetails"]/div[@class="eventDetails"]')

[<HtmlXPathSelector xpath='//div[@class="subContent2"]/div[@id="contentDetails"]/div[@class="eventDetails"]' data=u'<div class="eventDetails">\n '>]

>>> hxs.select('//div[@class="subContent2"]/div[@id="contentDetails"]/div[@class="eventDetails"]/h2/a/text()').extract()

[u'Some data']

shiva krishna

unread,

Jun 13, 2012, 1:51:59 AM6/13/12

to scrapy...@googlegroups.com

Thanks for u r reply........

But how it worked... i tried u r xpaths but they are not working its returning an empty list , whether xpath doesn't fetch data from javascript containing tags ? if so what was the alternative to fetch data from those tags, please let me know this.....

Thanks in advance

--

You received this message because you are subscribed to the Google Groups "scrapy-users" group.

To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/HL-j_a-24H0J.

Steven Almeroth

unread,

Jun 13, 2012, 10:32:54 AM6/13/12

to scrapy...@googlegroups.com

Shiva, please post your interactive shell session output so we can see what you are doing. XPath, and thus Scrapy, have no problem parsing markup with JavaScript, CoffeeScript or any other kind of script tucked in the tag parameters... Scrapy will not interpret the script, but it has no problem parsing around it. Your problem has to do with something else.

What happens when you do this:

print response.body

and this:

hxs.select('//text()').extract()

On Tuesday, June 12, 2012 7:25:32 AM UTC-5, shiva krishna wrote:

Message has been deleted

shiva krishna

unread,

Jun 14, 2012, 1:57:20 AM6/14/12

to scrapy...@googlegroups.com

Hi Steven ,

When i printed response.body, it displayed all the html content of the page including javascript

When i printed hxs.select('//text()').extract(),its showing something like this

[ u'\n\t html { height: 100% }\n\t body { height: 100%; margin: 0px; padding: 0px }\n\t #mapCont { height: 100% }\n\t',

u'\n\n\n\t',

u'White Pages',

u'\ntry {\nvar pageTracker = _gat._getTracker("UA-1220997- 2");\npageTracker._setDomainName(".example.com");\npageTracker._trackPageview();\n} catch(err) {}',

u'\nvar gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");\ndocument.write(unescape("%3Cscript src=\'" + gaJsHost + "google-analytics.com/ga.js\' type=\'text/javascript\'%3E%3C/script%3E"));\n'...............]

But when i viewed that response in browser by using the command view(response) , the browser showing the page by eliminating the HTML code generated by javascript .

Reply all

Reply to author

Forward